Hi,
I am testing support for NVIDIA’s Virtual GPU on OpenNebula 6.4, specifically with an NVIDIA Tesla T4, but I discovered that these cards do not support SR-IOV, or it does not work as expected.
Consulting the PCI device I can see that it supports SR-IOV
lspci -v -s 4b:00.0
4b:00.0 3D controller: NVIDIA Corporation TU104GL [Tesla T4] (rev a1)
Subsystem: NVIDIA Corporation Device 12a2
Physical Slot: 1
Flags: bus master, fast devsel, latency 0, IRQ 18, NUMA node 1, IOMMU group 40
Memory at de000000 (32-bit, non-prefetchable) [size=16M]
Memory at 23fc0000000 (64-bit, prefetchable) [size=256M]
Memory at 23ff0000000 (64-bit, prefetchable) [size=32M]
Capabilities: [60] Power Management version 3
Capabilities: [68] Null
Capabilities: [78] Express Endpoint, MSI 00
Capabilities: [c8] MSI-X: Enable+ Count=6 Masked-
Capabilities: [100] Virtual Channel
Capabilities: [258] L1 PM Substates
Capabilities: [128] Power Budgeting <?>
Capabilities: [420] Advanced Error Reporting
Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
Capabilities: [900] Secondary PCI Express
Capabilities: [bb0] Physical Resizable BAR
Capabilities: [bcc] Single Root I/O Virtualization (SR-IOV)
Capabilities: [c14] Alternative Routing-ID Interpretation (ARI)
Kernel driver in use: nvidia
Kernel modules: nouveau, nvidia_vgpu_vfio, nvidia
On the other hand, after install the NVidia vGPU driver I can see Host VGPU Mode: Non SR-IOV
nvidia-smi -q
==============NVSMI LOG==============
Timestamp : Fri Jan 26 14:57:49 2024
Driver Version : 535.129.03
CUDA Version : Not Found
vGPU Driver Capability
Heterogenous Multi-vGPU : Supported
Attached GPUs : 1
GPU 00000000:4B:00.0
Product Name : Tesla T4
Product Brand : NVIDIA
Product Architecture : Turing
Display Mode : Enabled
Display Active : Disabled
Persistence Mode : Enabled
Addressing Mode : N/A
vGPU Device Capability
Fractional Multi-vGPU : Supported
Heterogeneous Time-Slice Profiles : Supported
Heterogeneous Time-Slice Sizes : Not Supported
....
GPU Virtualization Mode
Virtualization Mode : Host VGPU
Host VGPU Mode : Non SR-IOV
The documentation say if your GPU support SR-IOV use sriov-manage command to enable virtual functions and if not check the NVIDIA documentation.
For non SR-IOV I follow this guide to create virtual GPU.
I can confirm that the /sys/bus/mdev/devices/ directory contains the mdev device file for the vGPU
and I can list them.
21bceb0c-c284-4db5-b8f9-807608e21fe5 -> ../../../devices/pci0000:4a/0000:4a:02.0/0000:4b:00.0/21bceb0c-c284-4db5-b8f9-807608e21fe5
87dd2ff0-2624-42f2-ba87-ba7a38bfce78 -> ../../../devices/pci0000:4a/0000:4a:02.0/0000:4b:00.0/87dd2ff0-2624-42f2-ba87-ba7a38bfce78
25258f79-52e3-443c-a0cc-014aca508c1e -> ../../../devices/pci0000:4a/0000:4a:02.0/0000:4b:00.0/904f8f79-52e3-443c-a0cc-014aca508c1e
c6f8d48d-7282-5d65-bd0c-dbe46602b734 -> ../../../devices/pci0000:4a/0000:4a:02.0/0000:4b:00.0/c6f8d48d-7282-5d65-bd0c-dbe46602b734
d000b06f-3a7a-4b29-baea-abda99443030 -> ../../../devices/pci0000:4a/0000:4a:02.0/0000:4b:00.0/d000b06f-3a7a-4b29-baea-abda99443030
fd67e9b8-d360-4792-80fe-a4fa62e56eea -> ../../../devices/pci0000:4a/0000:4a:02.0/0000:4b:00.0/fd67e9b8-d360-4792-80fe-a4fa62e56eea
If I list the active mediated devices on the hypervisor host
mdevctl list
21bceb0c-c284-4db5-b8f9-807608e21fe5 0000:4b:00.0 nvidia-222 auto (defined)
87dd2ff0-2624-42f2-ba87-ba7a38bfce78 0000:4b:00.0 nvidia-230 auto (defined)
904f8f79-52e3-443c-a0cc-014aca508c1e 0000:4b:00.0 nvidia-222 auto (defined)
c6f8d48d-7282-5d65-bd0c-dbe46602b734 0000:4b:00.0 nvidia-222 manual
d000b06f-3a7a-4b29-baea-abda99443030 0000:4b:00.0 nvidia-222 auto (defined)
fd67e9b8-d360-4792-80fe-a4fa62e56eea 0000:4b:00.0 nvidia-222 auto (defined)
However, when I show node information I only see one pci device and It is only possible to create one virtual machine.
onehost show node18111-1
PCI DEVICES
VM ADDR TYPE NAME
01:00.0 19a2:0120:0604 x1 PCIe Gen2 Bridge[Pilot4]
31:00.0 15b3:101b:0207 MT28908 Family [ConnectX-6]
4b:00.0 10de:1eb8:0302 NVIDIA Corporation TU104GL [Tesla T4]
The question are:
Is it possible to use non SR-iov support cards in OpenNebula >= 6.4 ?
If it is possible, how can use more than one vGPU or how can I add the vGPU to OpenNebula? I tried to add manually with virsh but It does not work.
Thanks in advance.