This blog post covers the Nvidia GPU setup with an A100 card along with NVAIE license. So it covers the driver installation procedure for Linux guests only*.
For vGPU/GRID licenses, which support Windows guests as well apart from Linux, please refer to the vGPU User Guide here.
For understanding the supported guest/VM OS with a specific type of Nvidia software license, please lookup this post.
*Update: After checking in with Nvidia Support, got to know that Windows OS also works with the Nvidia AI Enterprise license along with C-series vGPU types by disabling the Multi-Instance GPU (MIG) feature.
High-level steps:
- Setup a Nvidia Licensing Server – either Delegated License Server (DLS) or Cloud License Service (CLS) – via the Nvidia Licensing Portal (NLP).
- Generate and obtain the Client Configuration Token (see).
- Perform the physical installation of the GPU card into the server.
Follow the server manufacturer’s hardware install guide for instructions. Make sure to use the correct slot and PCI riser to populate your card. Most of times, the GPU card is shipped separately from the server. Ensure the server is running the correct compatible firmware. - Verify the GPU shows up in your server manufacturer’s respective hardware management tool/utility (Cisco UCSM/CIMC, HPE SMH, Dell OMSA etc).
- Install the drivers for the physical server’s OS and verify the same using nvidia-smi utility.
If it’s a bare-metal/physical server, skip to step 7. For virtualized/hypervisor environments, continue to step 6. - Install the drivers for the Linux guests/VM’s OS and verify the same using nvidia-smi utility.
- Activate the Nvidia client license – applicable to both bare-metal and hypervisor environments – and verify the licensed features are active.
For hypervisor environments, license activation is needed only for the VMs.
I chose the CLS for the licensing infrastructure portion, primarily for it takes away the setup and maintenance of a connectivity appliance on-premises, which is required in the case of DLS).
A Cloud License Service (CLS) instance to use as a License Server is created. It is an FOC (free-of-charge Ref) service hosted on the Nvidia Licensing Portal (NLP).
To enable communication between a licensed client (guest/VM) and a CLS instance, the ports 443 and 80 must be open in your firewall or proxy (Ref).
DLS is an excellent option for dark sites and air-gapped environments where access to Internet is either completely disconnected and/or restricted. But you are free to choose it for Internet connected environments as well.
A “Client Configuration Token” file (.tok) is generated from NLP to be supplied to supported Linux Guests/VMs per the support matrix. This will activate the AI Enterprise license and hence the features.
Let’s get going with the installation flow (except steps 3-4 related to the hardware setup).
To enable Nvidia AI Enterprise (NVAIE):
- Create a CLS instance on the NLP
- Generate a Client Configuration Token file
- Download the drivers based on the below combination:
o Product Family (Options: vGPU or NVAIE, filter “NVAIE”)
o Platform (Hypervisor – vSphere, Hyper-V etc.)
o Platform Version (Hypervisor Version – 6.7, 7.0, 8.0, 2019, 2022 etc.)
o Product Version (NVAIE Version) - Install the 2 zip files in the Host folder (from the downloaded content) for vCS profile as AI Enterprise only supports Linux guests
(Single vib file is for vWS profile but is of no use with AI Enterprise as only C-series vGPU types will be available and not Q- and B-series vGPU types). - Enable Multi-Instance GPU (MIG) feature as A100 hardware supports it by running the command:
nvidia-smi -mig 1
- Add New Device > PCI Device to a supported Linux VM by using “Edit Settings” in the vSphere Web Client for the specific VM
- Install/verify VMware Tools are present on the VM
- Verify Nvidia Device/hardware is present in VM by running the command:
lspci | egrep "3d|VGA"
- Disable the Nouveau driver for CentOS, Debian, Red Hat and Ubuntu Linux distributions
o Check if Nouveau driver is present –lsmod | grep nouveau
o If present, create a file – /etc/modprobe.d/blacklist-nouveau.conf – with the following contents:
blacklist nouveau
options nouveau modeset=0
o Regenerate the kernel initial RAM file system (initramfs) by running:
– CentOS, RHEL:sudo dracut --force
– Debian, Ubuntu:sudo update-initramfs -u
o Reboot the VM - On RHEL 8 and above, disable Wayland display server protocol
o Edit an existing file – /etc/gdm/custom.conf – and uncomment the following option
WaylandEnable=false
o Save the file and reboot the VM - Install Nvidia guest driver
o CentOS, RHEL:rpm -iv ./nvidia-linux-grid-.rpm
o Debian, Ubuntu:sudo apt-get install ./nvidia-linux-grid-.deb
o Reboot the VM - Verify the Nvidia driver is installed and operational
o All Linux distributions:nvidia-smi
- Activate Nvidia client license (for all Linux distributions)
o Place the Client Configuration Token file at the default location
/etc/nvidia/ClientConfigToken
o Ensure token file has read, write and execute permissions for the owner and read permission for group and others
chmod 744 client-configuration-token-directory/client_configuration_token_*.tok
o Create the gridd.conf file from grid.conf.template kept at /etc/nvidia
o Edit the grid.conf file with
FeatureType=1
o If using proxy server to reach CLS, add the following lines as well
ProxyServerAddress=address
ProxyServerPort=port
ProxyUserName=domain\username
ProxyCredentialsFilePath=path
o Restart thenvidia-gridd
service - Verify licensed features are active and a license has been leased by running “nvidia-smi -q”
- License leasing can also be verified from the NLP