Graphics

Nvidia Data Center GPU Installation and Configuration

This blog post covers the Nvidia GPU setup with an A100 card along with NVAIE license. So it covers the driver installation procedure for Linux guests only*.
For vGPU/GRID licenses, which support Windows guests as well apart from Linux, please refer to the vGPU User Guide here.
For understanding the supported guest/VM OS with a specific type of Nvidia software license, please lookup this post.

*Update: After checking in with Nvidia Support, got to know that Windows OS also works with the Nvidia AI Enterprise license along with C-series vGPU types by disabling the Multi-Instance GPU (MIG) feature.

High-level steps:

  1. Setup a Nvidia Licensing Server – either Delegated License Server (DLS) or Cloud License Service (CLS) – via the Nvidia Licensing Portal (NLP).
  2. Generate and obtain the Client Configuration Token (see).
  3. Perform the physical installation of the GPU card into the server.
    Follow the server manufacturer’s hardware install guide for instructions. Make sure to use the correct slot and PCI riser to populate your card. Most of times, the GPU card is shipped separately from the server. Ensure the server is running the correct compatible firmware.
  4. Verify the GPU shows up in your server manufacturer’s respective hardware management tool/utility (Cisco UCSM/CIMC, HPE SMH, Dell OMSA etc).
  5. Install the drivers for the physical server’s OS and verify the same using nvidia-smi utility.
    If it’s a bare-metal/physical server, skip to step 7. For virtualized/hypervisor environments, continue to step 6.
  6. Install the drivers for the Linux guests/VM’s OS and verify the same using nvidia-smi utility.
  7. Activate the Nvidia client license – applicable to both bare-metal and hypervisor environments – and verify the licensed features are active.
    For hypervisor environments, license activation is needed only for the VMs.

I chose the CLS for the licensing infrastructure portion, primarily for it takes away the setup and maintenance of a connectivity appliance on-premises, which is required in the case of DLS).
A Cloud License Service (CLS) instance to use as a License Server is created. It is an FOC (free-of-charge Ref) service hosted on the Nvidia Licensing Portal (NLP).
To enable communication between a licensed client (guest/VM) and a CLS instance, the ports 443 and 80 must be open in your firewall or proxy (Ref).
DLS is an excellent option for dark sites and air-gapped environments where access to Internet is either completely disconnected and/or restricted. But you are free to choose it for Internet connected environments as well.

A “Client Configuration Token” file (.tok) is generated from NLP to be supplied to supported Linux Guests/VMs per the support matrix. This will activate the AI Enterprise license and hence the features.

Let’s get going with the installation flow (except steps 3-4 related to the hardware setup).

To enable Nvidia AI Enterprise (NVAIE):

  • Create a CLS instance on the NLP
  • Generate a Client Configuration Token file
  • Download the drivers based on the below combination:
    o Product Family (Options: vGPU or NVAIE, filter “NVAIE”)
    o Platform (Hypervisor – vSphere, Hyper-V etc.)
    o Platform Version (Hypervisor Version – 6.7, 7.0, 8.0, 2019, 2022 etc.)
    o Product Version (NVAIE Version)
  • Install the 2 zip files in the Host folder (from the downloaded content) for vCS profile as AI Enterprise only supports Linux guests
    (Single vib file is for vWS profile but is of no use with AI Enterprise as only C-series vGPU types will be available and not Q- and B-series vGPU types).
  • Enable Multi-Instance GPU (MIG) feature as A100 hardware supports it by running the command: nvidia-smi -mig 1
  • Add New Device > PCI Device to a supported Linux VM by using “Edit Settings” in the vSphere Web Client for the specific VM
  • Install/verify VMware Tools are present on the VM
  • Verify Nvidia Device/hardware is present in VM by running the command:
    lspci | egrep "3d|VGA"
  • Disable the Nouveau driver for CentOS, Debian, Red Hat and Ubuntu Linux distributions
    o Check if Nouveau driver is present – lsmod | grep nouveau
    o If present, create a file – /etc/modprobe.d/blacklist-nouveau.conf – with the following contents:
    blacklist nouveau
    options nouveau modeset=0
    o Regenerate the kernel initial RAM file system (initramfs) by running:
    – CentOS, RHEL: sudo dracut --force
    – Debian, Ubuntu: sudo update-initramfs -u
    o Reboot the VM
  • On RHEL 8 and above, disable Wayland display server protocol
    o Edit an existing file – /etc/gdm/custom.conf – and uncomment the following option
    WaylandEnable=false
    o Save the file and reboot the VM
  • Install Nvidia guest driver
    o CentOS, RHEL: rpm -iv ./nvidia-linux-grid-.rpm
    o Debian, Ubuntu: sudo apt-get install ./nvidia-linux-grid-.deb
    o Reboot the VM
  • Verify the Nvidia driver is installed and operational
    o All Linux distributions: nvidia-smi
  • Activate Nvidia client license (for all Linux distributions)
    o Place the Client Configuration Token file at the default location
    /etc/nvidia/ClientConfigToken
    o Ensure token file has read, write and execute permissions for the owner and read permission for group and others
    chmod 744 client-configuration-token-directory/client_configuration_token_*.tok
    o Create the gridd.conf file from grid.conf.template kept at /etc/nvidia
    o Edit the grid.conf file with
    FeatureType=1
    o If using proxy server to reach CLS, add the following lines as well
    ProxyServerAddress=address
    ProxyServerPort=port
    ProxyUserName=domain\username
    ProxyCredentialsFilePath=path
    o Restart the nvidia-gridd service
  • Verify licensed features are active and a license has been leased by running “nvidia-smi -q”
  • License leasing can also be verified from the NLP

Ref: Cisco CVD, Nvidia AI Enterprise User Guide

Nvidia Data Center Licensing

With Nvidia entering into the $1 trillion club and everyone trying to get on the AI bandwagon, perhaps it’s the right time to post a write-up on their DC licensing.

While working on the delivery of an HCI (Cisco HyperFlex) solution containing Nvidia A100 80GB PCIe cards (these are beefy ones! – x16 double wide [PCIe slot width] – see below pic, with it sitting on the PCI riser), stumbled upon a customer question about vGPU support for different guest/VM OSes.

With some assistance from the local OEM (Nvidia) partner support team and a little digging into Nvidia documents, led me to the below information and is at least tested and verified for A100 (hardware) + Nvidia AI Enterprise [NVAIE] (software license).

Nvidia provides different profiles and vGPU types for use with different OSes and platforms (bare-metal/physical and virtualized/guest/VM).

Profiles are a result of the combination of the GPU hardware and software licenses bought. Below are the different profiles and vGPU types available with them.

In my case, even though the GPU hardware (A100) is capable of supporting both the vWS and vCS profiles, we can only truly use the vCS essentially.
This is due to the above stated fact –
profiles available eventually = combination of GPU hardware model + software license.
So the vWS profile is of no use here with AI Enterprise, as only C-series vGPU types will be available and not Q- and B-series vGPU types. C-series is any which ways available with vCS profile.

Profile/License EditionvGPU TypeOS SupportedUse case
vGPU (formerly GRID) Products
vApps
Virtual Applications
A-seriesWindows,LinuxVDI – Streaming Apps, Citrix XenApp/Apps, MS RDSH, Horizon Apps
vPC
Virtual PC “Business”
B-seriesWindows, LinuxVDI – Desktops, Citrix XenDesktop/CVAD, Horizon
vCS
Virtual Compute Server
C-seriesLinuxAI, DL, HPC, Advanced Data Science
vDWS or RTX vWS
Virtual Data Center Workstation “Quadro”
Q-series C-series B-seriesWindows, Linux Linux
Windows, Linux
3D Apps
Advanced Data Science
VDI
Compute-intensive Workload Product
AI EnterpriseC-seriesLinux*AI, DL, HPC, Advanced Data Science
Table: Nvidia Data Center License Categories

Note 1: vCS is now only available through AI Enterprise and not through vGPU category.
Ref: Nvidia vGPU Licensing Guide, Nvidia AI Enterprise Licensing Guide

Note 2: NVAIE licenses bought prior to April 2023 with old SKU (731-AIE001+P2CMI60) covers 1 GPU and 10 VMs. Newer SKU covers 1 GPU and 20 VMs (screen grab from the above linked NVAIE licensing guide).

*Update: After checking in with Nvidia Support, got to know that Windows OS also works with the Nvidia AI Enterprise license along with C-series vGPU types by disabling the Multi-Instance GPU (MIG) feature.