Month: June 2023

Quick Note: VDI External Access and Connectivity

When I started to work with VDI solutions, I often looked for the term “Public IP” in the solution’s documentation for VDI external access. However, after having designed and implemented multiple Digital Workspace/End-User Computing (EUC) solutions for Virtual Desktop Infrastructure (VDI) and Desktop-as-a-Service (DaaS), I have quite some knowledge on this topic. I figured I’d share it for others starting in the EUC space to give them a fair bit of an idea of what to look for and how the external access works.

“External address” is the term you should be looking in this case. With VMware Horizon, Unified Access Gateway (UAG) facilitates the external access for end-users to consume their apps and desktops. While, it is the Citrix Gateway (CGW) in case of Citrix Virtual Apps and Desktops (CVAD). In production environments, there will be 2 of these sitting in a DMZ forwarding the traffic to Connection Servers or StoreFront with the respective solutions. There will be a Load Balancer (LB) appliance before the UAG and CGW.

UAG in latest Horizon versions (at least v7 and above) have an HA feature that can provide L4 LB functionality. If full L4-L7 LB capabilities are needed then an external LB such as NSX Advanced LB (Avi), Citrix ADC, F5 BIG IP LB etc should be used.

Horizon

If HA feature is used, and say you have 2 UAGs, then 3 Public IPs will be needed – 1 for each UAG’s DMZ Internet/External facing IP and 1 for the HA VIP, which is again from the DMZ segment. Now you will create a Public DNS record with your Public DNS hosting provider for VDI Access URL say vdi.company.com pointing to the UAG HA VIP Public IP. On your Network Firewall device, Public IP will be NAT’d to UAG HA VIP Private IP in DMZ.

If LB is used, then the process is similar except that instead of UAG HA VIP IP addresses, UAG LB VIP will be used. With an LB, there are options to perform PAT instead of NAT, which will reduce the count of Public IP to 1. Ref VMware KB 2146312.

Citrix

Here, an LB will be required to load balance the CGW appliances. Since last few years Citrix has combined the setup file of ADC and CGW. So it is a single appliance file and we can enable the functionality of the required component either ADC or CGW or both. In case of ADC/CGW, there will be mainly 3 IPs – NSIP (NetScaler IP: Appliance Mgmt IP), SNIP (Subnet IP: Back-end communication to Citrix infra, existing services and components) and the VIP (Virtual IP: Internet facing interface in DMZ). This VIP will be the NAT’d to the Public IP.

Split DNS

If you wish to keep the same VDI access URL for both internal and external users, you need to make use of Split DNS. For this, you create a zone eg. company.com in both the internal and external/Public DNS servers. In the Public DNS server, a host (A) record eg. vdi.company.com will be created to point to the Public IP of the LB or UAG or CGW as the case may be. In the internal DNS server, a host (A) record eg. vdi.company.com will be created to point to the Connections Server(s) (or their LB) or StoreFront(s) (or their LB) as the case may be.

Tools and Nuances of Cisco HyperFlex (HX)

HX Sizer: https://hyperflexsizer.cloudapps.cisco.com (CCO ID required)

HX Preinstall Tool: https://hxpreinstall.cloudapps.cisco.com (CCO ID required)

HX Hypercheck (Health Check) Tool for HXDP 4.0 and less: https://github.com/CiscoDevNet/Hyperflex-Hypercheck
Hypercheck is available as an in-built utility in HXDP 4.5+

A very less known trick/fact: HXDP Installer can be run from an HX node (HX ESXi Host) which will be part of the cluster that will be formed – could be called as a Nested HX Installer method.
This is extremely useful in ROBO/edge greenfield scenarios where there is no computing system (PC or laptop) to install the HXDP Installer on (via Type-2 Hosted Hypervisor solutions – Oracle VirtualBox, VMware Workstation/Player).

Tech Note Nested vCenter on HX: https://www.cisco.com/c/en/us/td/docs/hyperconverged_systems/HyperFlex_HX_DataPlatformSoftware/TechNotes/Nested_vcenter_on_hyperflex.html

Tip: Fixing Cisco UCS Fabric Interconnect CLI/SSH Access

Post upgrading UCSM firmware and/or installing internal Certificate Authority (CA) issued certificate on the UCSM, direct SSH via PuTTY client might get rejected with error “Remote side unexpectedly closed network connection”. This is due to the update of ciphers that happens with the firmware upgrade or certificate installation.

Solution is to either use OpenSSH client or the latest version of PuTTY client (0.78) (Ref 1, 2).

OpenSSH access format: ssh username@IPaddress

Tip: vCenter High Availability (vCHA) with Hyperconverged Infrastructure (HCI)

Quick tip on circumventing a known limitation of HCI solutions with NFS v4 datastores and vCHA.

vCenter Server Appliance (vCSA) in High Availability mode is made up of 3 instances – Active, Passive and Witness nodes.

At least 2 of the 3 instances (active, passive or witness) must be running all the time for vCenter Server HA to work.

To achieve true HA of the vCenter Server, it is recommend to place at least the witness instance on a separate ESXi Host or Cluster than the active and passive instances. This is due to the fact that – consider there are 2 clusters, 3 instances will most probably be split as 1 of 3 in 1st cluster and remaining 2 of 3 in the 2nd cluster. Now if the 2nd cluster goes bad, there will be no vCenter Server access, let alone the HA!

Ideally, all the 3 instances should be in 3 different clusters.

In smaller environments with 1 or 2 clusters, you may need to place 2 or all of the 3 instances in the same cluster. In such scenarios, you might encounter errors at the cloning stage (Active instance is deployed first and then it is used as a base template for cloning Passive and Witness with appropriate network settings [2 and 1 vNIC/adapter] – all of this is done by the vCHA automation process or could be done manually as well by creating a Guest OS customization specification). This is due to a known issue related to NFS v3 datastore (Ref).

Solution is to use a NFS v4 datastore or to place the Witness instance on a different datastore than the Active instance. Pick the latter in case of HCI solutions which are only to able to create NFS v3 datastores.

This has been tested with Cisco HyperFlex (Ref). But same is the case with others HCI OEM vendors.

Tip: UCS Standalone Server and Nexus Configuration

Quick tip about the network configuration required on the Nexus end with UCS standalone servers.

Cisco UCS standalone servers with VIC converged network adapter (CNA) cards require Forward Error Correction (FEC) mode to be set on the upstream switch [port(s) connected to the server’s VIC port(s)] equivalent to the one set on the server.

“cl91” is the default FEC mode set on Cisco UCS servers. This configuration is present in CIMC > Networking > Adapter Card MLOM > External Ethernet Interfaces > Admin FEC Mode.

On the upstream switches (Nexus), FEC mode parameter must be included in the port configuration as
fec rs-fec

This is over and above the standard port configuration parameters like port-channel, VLAN mode (trunk or access) etc.

If the FEC mode parameter is either missing or configured but not with the same mode across the pipeline, ports at both the ends won’t come up (green port link status LED) and will stay down (red port link status LED).

This has been tested with the below configuration.

  • Compute: Server
    • Make-Model: Cisco UCS C220 M6
    • Firmware/HUU: 4.2(3d)
    • OS: WS 2022 Std
  • Network: ToR/Upstream Switch
    • Make-Model: Cisco N9K-C93180YC-FX
    • OS: NX-OS

Nvidia Data Center GPU Installation and Configuration

This blog post covers the Nvidia GPU setup with an A100 card along with NVAIE license. So it covers the driver installation procedure for Linux guests only*.
For vGPU/GRID licenses, which support Windows guests as well apart from Linux, please refer to the vGPU User Guide here.
For understanding the supported guest/VM OS with a specific type of Nvidia software license, please lookup this post.

*Update: After checking in with Nvidia Support, got to know that Windows OS also works with the Nvidia AI Enterprise license along with C-series vGPU types by disabling the Multi-Instance GPU (MIG) feature.

High-level steps:

  1. Setup a Nvidia Licensing Server – either Delegated License Server (DLS) or Cloud License Service (CLS) – via the Nvidia Licensing Portal (NLP).
  2. Generate and obtain the Client Configuration Token (see).
  3. Perform the physical installation of the GPU card into the server.
    Follow the server manufacturer’s hardware install guide for instructions. Make sure to use the correct slot and PCI riser to populate your card. Most of times, the GPU card is shipped separately from the server. Ensure the server is running the correct compatible firmware.
  4. Verify the GPU shows up in your server manufacturer’s respective hardware management tool/utility (Cisco UCSM/CIMC, HPE SMH, Dell OMSA etc).
  5. Install the drivers for the physical server’s OS and verify the same using nvidia-smi utility.
    If it’s a bare-metal/physical server, skip to step 7. For virtualized/hypervisor environments, continue to step 6.
  6. Install the drivers for the Linux guests/VM’s OS and verify the same using nvidia-smi utility.
  7. Activate the Nvidia client license – applicable to both bare-metal and hypervisor environments – and verify the licensed features are active.
    For hypervisor environments, license activation is needed only for the VMs.

I chose the CLS for the licensing infrastructure portion, primarily for it takes away the setup and maintenance of a connectivity appliance on-premises, which is required in the case of DLS).
A Cloud License Service (CLS) instance to use as a License Server is created. It is an FOC (free-of-charge Ref) service hosted on the Nvidia Licensing Portal (NLP).
To enable communication between a licensed client (guest/VM) and a CLS instance, the ports 443 and 80 must be open in your firewall or proxy (Ref).
DLS is an excellent option for dark sites and air-gapped environments where access to Internet is either completely disconnected and/or restricted. But you are free to choose it for Internet connected environments as well.

A “Client Configuration Token” file (.tok) is generated from NLP to be supplied to supported Linux Guests/VMs per the support matrix. This will activate the AI Enterprise license and hence the features.

Let’s get going with the installation flow (except steps 3-4 related to the hardware setup).

To enable Nvidia AI Enterprise (NVAIE):

  • Create a CLS instance on the NLP
  • Generate a Client Configuration Token file
  • Download the drivers based on the below combination:
    o Product Family (Options: vGPU or NVAIE, filter “NVAIE”)
    o Platform (Hypervisor – vSphere, Hyper-V etc.)
    o Platform Version (Hypervisor Version – 6.7, 7.0, 8.0, 2019, 2022 etc.)
    o Product Version (NVAIE Version)
  • Install the 2 zip files in the Host folder (from the downloaded content) for vCS profile as AI Enterprise only supports Linux guests
    (Single vib file is for vWS profile but is of no use with AI Enterprise as only C-series vGPU types will be available and not Q- and B-series vGPU types).
  • Enable Multi-Instance GPU (MIG) feature as A100 hardware supports it by running the command: nvidia-smi -mig 1
  • Add New Device > PCI Device to a supported Linux VM by using “Edit Settings” in the vSphere Web Client for the specific VM
  • Install/verify VMware Tools are present on the VM
  • Verify Nvidia Device/hardware is present in VM by running the command:
    lspci | egrep "3d|VGA"
  • Disable the Nouveau driver for CentOS, Debian, Red Hat and Ubuntu Linux distributions
    o Check if Nouveau driver is present – lsmod | grep nouveau
    o If present, create a file – /etc/modprobe.d/blacklist-nouveau.conf – with the following contents:
    blacklist nouveau
    options nouveau modeset=0
    o Regenerate the kernel initial RAM file system (initramfs) by running:
    – CentOS, RHEL: sudo dracut --force
    – Debian, Ubuntu: sudo update-initramfs -u
    o Reboot the VM
  • On RHEL 8 and above, disable Wayland display server protocol
    o Edit an existing file – /etc/gdm/custom.conf – and uncomment the following option
    WaylandEnable=false
    o Save the file and reboot the VM
  • Install Nvidia guest driver
    o CentOS, RHEL: rpm -iv ./nvidia-linux-grid-.rpm
    o Debian, Ubuntu: sudo apt-get install ./nvidia-linux-grid-.deb
    o Reboot the VM
  • Verify the Nvidia driver is installed and operational
    o All Linux distributions: nvidia-smi
  • Activate Nvidia client license (for all Linux distributions)
    o Place the Client Configuration Token file at the default location
    /etc/nvidia/ClientConfigToken
    o Ensure token file has read, write and execute permissions for the owner and read permission for group and others
    chmod 744 client-configuration-token-directory/client_configuration_token_*.tok
    o Create the gridd.conf file from grid.conf.template kept at /etc/nvidia
    o Edit the grid.conf file with
    FeatureType=1
    o If using proxy server to reach CLS, add the following lines as well
    ProxyServerAddress=address
    ProxyServerPort=port
    ProxyUserName=domain\username
    ProxyCredentialsFilePath=path
    o Restart the nvidia-gridd service
  • Verify licensed features are active and a license has been leased by running “nvidia-smi -q”
  • License leasing can also be verified from the NLP

Ref: Cisco CVD, Nvidia AI Enterprise User Guide

Nvidia Data Center Licensing

With Nvidia entering into the $1 trillion club and everyone trying to get on the AI bandwagon, perhaps it’s the right time to post a write-up on their DC licensing.

While working on the delivery of an HCI (Cisco HyperFlex) solution containing Nvidia A100 80GB PCIe cards (these are beefy ones! – x16 double wide [PCIe slot width] – see below pic, with it sitting on the PCI riser), stumbled upon a customer question about vGPU support for different guest/VM OSes.

With some assistance from the local OEM (Nvidia) partner support team and a little digging into Nvidia documents, led me to the below information and is at least tested and verified for A100 (hardware) + Nvidia AI Enterprise [NVAIE] (software license).

Nvidia provides different profiles and vGPU types for use with different OSes and platforms (bare-metal/physical and virtualized/guest/VM).

Profiles are a result of the combination of the GPU hardware and software licenses bought. Below are the different profiles and vGPU types available with them.

In my case, even though the GPU hardware (A100) is capable of supporting both the vWS and vCS profiles, we can only truly use the vCS essentially.
This is due to the above stated fact –
profiles available eventually = combination of GPU hardware model + software license.
So the vWS profile is of no use here with AI Enterprise, as only C-series vGPU types will be available and not Q- and B-series vGPU types. C-series is any which ways available with vCS profile.

Profile/License EditionvGPU TypeOS SupportedUse case
vGPU (formerly GRID) Products
vApps
Virtual Applications
A-seriesWindows,LinuxVDI – Streaming Apps, Citrix XenApp/Apps, MS RDSH, Horizon Apps
vPC
Virtual PC “Business”
B-seriesWindows, LinuxVDI – Desktops, Citrix XenDesktop/CVAD, Horizon
vCS
Virtual Compute Server
C-seriesLinuxAI, DL, HPC, Advanced Data Science
vDWS or RTX vWS
Virtual Data Center Workstation “Quadro”
Q-series C-series B-seriesWindows, Linux Linux
Windows, Linux
3D Apps
Advanced Data Science
VDI
Compute-intensive Workload Product
AI EnterpriseC-seriesLinux*AI, DL, HPC, Advanced Data Science
Table: Nvidia Data Center License Categories

Note 1: vCS is now only available through AI Enterprise and not through vGPU category.
Ref: Nvidia vGPU Licensing Guide, Nvidia AI Enterprise Licensing Guide

Note 2: NVAIE licenses bought prior to April 2023 with old SKU (731-AIE001+P2CMI60) covers 1 GPU and 10 VMs. Newer SKU covers 1 GPU and 20 VMs (screen grab from the above linked NVAIE licensing guide).

*Update: After checking in with Nvidia Support, got to know that Windows OS also works with the Nvidia AI Enterprise license along with C-series vGPU types by disabling the Multi-Instance GPU (MIG) feature.