Problems with NVIDIA vGPU

Hello,

I’m trying to share a GPU NVIDIA L40S using vGPU for some VMs (maybe 32, depending on vGPU applied L40S-1Q. First of all, I have read NVIDIA documentation and, after requesting a 90-day free licenses, I have installed “nvidia-vgpu” driver in my Ubuntu-22.04.
Then, I have got some “virtual functions” from NVIDIA driver, I have got lspci information and it seems that from the operative systems, there are 32 vgpus.
After that, I haver read and execute NVIDIA vGPU & MIG | and PCI Passthrough |. However, I haven’t got any available PCI device.

My /var/lib/one/remotes/etc/im/kvm-probes.d/pci.conf is:

:filter:
  - '10de:*'

:short_address:
  - 'c3:00.4'
  - 'c3:00.5'
  - 'c3:00.6'
  - 'c3:00.7'
  - 'c3:01.0'
  - 'c3:01.2'
  - 'c3:01.3'
  - 'c3:01.4'
  - 'c3:01.5'
  - 'c3:01.6'
  - 'c3:01.7'
  - 'c3:02.0'
  - 'c3:02.1'
  - 'c3:02.2'
  - 'c3:02.3'
  - 'c3:02.4'
  - 'c3:02.5'
  - 'c3:02.6'
  - 'c3:02.7'
  - 'c3:03.0'
  - 'c3:03.1'
  - 'c3:03.2'
  - 'c3:03.3'
  - 'c3:03.4'
  - 'c3:03.5'
  - 'c3:03.6'
  - 'c3:03.7'
  - 'c3:04.0'
  - 'c3:04.1'
  - 'c3:04.2'
  - 'c3:04.3'

:device_name:
  - 'NVIDIA L40S'

:nvidia_vendors:
  - '10de'

From CLI:

virsh nodedev-dumpxml pci_0000_c3_00_0 | egrep 'domain|bus|slot|function'
    <domain>0</domain>
    <bus>195</bus>
    <slot>0</slot>
    <function>0</function>
    <capability type='virt_functions' maxCount='32'>
      <address domain='0x0000' bus='0xc3' slot='0x00' function='0x4'/>
      <address domain='0x0000' bus='0xc3' slot='0x00' function='0x5'/>
      <address domain='0x0000' bus='0xc3' slot='0x00' function='0x6'/>
      <address domain='0x0000' bus='0xc3' slot='0x00' function='0x7'/>
      <address domain='0x0000' bus='0xc3' slot='0x01' function='0x0'/>
      <address domain='0x0000' bus='0xc3' slot='0x01' function='0x1'/>
      <address domain='0x0000' bus='0xc3' slot='0x01' function='0x2'/>
      <address domain='0x0000' bus='0xc3' slot='0x01' function='0x3'/>
      <address domain='0x0000' bus='0xc3' slot='0x01' function='0x4'/>
      <address domain='0x0000' bus='0xc3' slot='0x01' function='0x5'/>
      <address domain='0x0000' bus='0xc3' slot='0x01' function='0x6'/>
      <address domain='0x0000' bus='0xc3' slot='0x01' function='0x7'/>
      <address domain='0x0000' bus='0xc3' slot='0x02' function='0x0'/>
      <address domain='0x0000' bus='0xc3' slot='0x02' function='0x1'/>
      <address domain='0x0000' bus='0xc3' slot='0x02' function='0x2'/>
      <address domain='0x0000' bus='0xc3' slot='0x02' function='0x3'/>
      <address domain='0x0000' bus='0xc3' slot='0x02' function='0x4'/>
      <address domain='0x0000' bus='0xc3' slot='0x02' function='0x5'/>
      <address domain='0x0000' bus='0xc3' slot='0x02' function='0x6'/>
      <address domain='0x0000' bus='0xc3' slot='0x02' function='0x7'/>
      <address domain='0x0000' bus='0xc3' slot='0x03' function='0x0'/>
      <address domain='0x0000' bus='0xc3' slot='0x03' function='0x1'/>
      <address domain='0x0000' bus='0xc3' slot='0x03' function='0x2'/>
      <address domain='0x0000' bus='0xc3' slot='0x03' function='0x3'/>
      <address domain='0x0000' bus='0xc3' slot='0x03' function='0x4'/>
      <address domain='0x0000' bus='0xc3' slot='0x03' function='0x5'/>
      <address domain='0x0000' bus='0xc3' slot='0x03' function='0x6'/>
      <address domain='0x0000' bus='0xc3' slot='0x03' function='0x7'/>
      <address domain='0x0000' bus='0xc3' slot='0x04' function='0x0'/>
      <address domain='0x0000' bus='0xc3' slot='0x04' function='0x1'/>
      <address domain='0x0000' bus='0xc3' slot='0x04' function='0x2'/>
      <address domain='0x0000' bus='0xc3' slot='0x04' function='0x3'/>
      <address domain='0x0000' bus='0xc3' slot='0x00' function='0x0'/>

It seems all process based on NVIDIA documentation has been finished correctly, but from OpenNebula, after modifying /var/lib/one/remotes/etc/im/kvm-probes.d/pci.conf and run “onehost sync --force”, it seems that something is not working fine, because “onehost show -j 0” shows it:

[...]
        "MAX_DISK": "1800515",
        "USED_DISK": "74082"
      },
      "PCI_DEVICES": {},
      "NUMA_NODES": {
        "NODE": {
          "CORE": [
            {
              "CPUS": "21:-1,45:-1",
              "DEDICATED": "NO",
              "FREE": "2",
[...]

There is NO PCI_DEVICES, but:

oneadmin@test-gpu:~$ onehost show 0 -j | grep PCI
      "PCI_DEVICES": {},
      "PCI_FILTER": "10de:26b9",

Then, from Sunstone, if I try to modify host in Infrastructure menu, in PCI tab I can’t see any PCI device:


So I can’t attach any PCI device (any NVIDIA vGPU device) to any VM.
Also, if I modify template and in “PCI Devices” add a PCI as “Specific device” with value “c3:01.3”, VM not starts and scheduler returns this error in rank_sched.log:
rank_sched.log:Fri Dec 19 11:39:56 2025 [Z0][SCHED][DD]: Host 0 discarded for VM 19. Unavailable PCI device.

What am I doing wrong? Am I missing something?

Please, I need help. If someone want more detailed information about what is configured in my system, please, tell me.

Thanks!!!

Hello @Daniel_Ruiz_Molina,

Happy holidays, and apologies for the late reply.

Please, could you share the output of:

  • lspci -nn | grep -i nvidia
  • ls /sys/class/mdev_bus/
  • onehost show <host-id> -j | jq .PCI_FILTER
  • Confirm the NVIDIA vGPU driver version and the host OS version.

I’ve already asked more help to the engineering team.

Regards,

Hello @FrancJP

Happy new year 2026!!!

Outputs:

* lspci -nn | grep -i nvidia

c3:00.0 3D controller [0302]: NVIDIA Corporation Device [10de:26b9] (rev a1)
c3:00.4 3D controller [0302]: NVIDIA Corporation Device [10de:26b9] (rev a1)
c3:00.5 3D controller [0302]: NVIDIA Corporation Device [10de:26b9] (rev a1)
c3:00.6 3D controller [0302]: NVIDIA Corporation Device [10de:26b9] (rev a1)
c3:00.7 3D controller [0302]: NVIDIA Corporation Device [10de:26b9] (rev a1)
c3:01.0 3D controller [0302]: NVIDIA Corporation Device [10de:26b9] (rev a1)
c3:01.1 3D controller [0302]: NVIDIA Corporation Device [10de:26b9] (rev a1)
c3:01.2 3D controller [0302]: NVIDIA Corporation Device [10de:26b9] (rev a1)
c3:01.3 3D controller [0302]: NVIDIA Corporation Device [10de:26b9] (rev a1)
c3:01.4 3D controller [0302]: NVIDIA Corporation Device [10de:26b9] (rev a1)
c3:01.5 3D controller [0302]: NVIDIA Corporation Device [10de:26b9] (rev a1)
c3:01.6 3D controller [0302]: NVIDIA Corporation Device [10de:26b9] (rev a1)
c3:01.7 3D controller [0302]: NVIDIA Corporation Device [10de:26b9] (rev a1)
c3:02.0 3D controller [0302]: NVIDIA Corporation Device [10de:26b9] (rev a1)
c3:02.1 3D controller [0302]: NVIDIA Corporation Device [10de:26b9] (rev a1)
c3:02.2 3D controller [0302]: NVIDIA Corporation Device [10de:26b9] (rev a1)
c3:02.3 3D controller [0302]: NVIDIA Corporation Device [10de:26b9] (rev a1)
c3:02.4 3D controller [0302]: NVIDIA Corporation Device [10de:26b9] (rev a1)
c3:02.5 3D controller [0302]: NVIDIA Corporation Device [10de:26b9] (rev a1)
c3:02.6 3D controller [0302]: NVIDIA Corporation Device [10de:26b9] (rev a1)
c3:02.7 3D controller [0302]: NVIDIA Corporation Device [10de:26b9] (rev a1)
c3:03.0 3D controller [0302]: NVIDIA Corporation Device [10de:26b9] (rev a1)
c3:03.1 3D controller [0302]: NVIDIA Corporation Device [10de:26b9] (rev a1)
c3:03.2 3D controller [0302]: NVIDIA Corporation Device [10de:26b9] (rev a1)
c3:03.3 3D controller [0302]: NVIDIA Corporation Device [10de:26b9] (rev a1)
c3:03.4 3D controller [0302]: NVIDIA Corporation Device [10de:26b9] (rev a1)
c3:03.5 3D controller [0302]: NVIDIA Corporation Device [10de:26b9] (rev a1)
c3:03.6 3D controller [0302]: NVIDIA Corporation Device [10de:26b9] (rev a1)
c3:03.7 3D controller [0302]: NVIDIA Corporation Device [10de:26b9] (rev a1)
c3:04.0 3D controller [0302]: NVIDIA Corporation Device [10de:26b9] (rev a1)
c3:04.1 3D controller [0302]: NVIDIA Corporation Device [10de:26b9] (rev a1)
c3:04.2 3D controller [0302]: NVIDIA Corporation Device [10de:26b9] (rev a1)
c3:04.3 3D controller [0302]: NVIDIA Corporation Device [10de:26b9] (rev a1)
  • ls /sys/class/mdev_bus/
0000:c3:00.4  0000:c3:00.7  0000:c3:01.2  0000:c3:01.5  0000:c3:02.0  0000:c3:02.3  0000:c3:02.6  0000:c3:03.1  0000:c3:03.4  0000:c3:03.7  0000:c3:04.2
0000:c3:00.5  0000:c3:01.0  0000:c3:01.3  0000:c3:01.6  0000:c3:02.1  0000:c3:02.4  0000:c3:02.7  0000:c3:03.2  0000:c3:03.5  0000:c3:04.0  0000:c3:04.3
0000:c3:00.6  0000:c3:01.1  0000:c3:01.4  0000:c3:01.7  0000:c3:02.2  0000:c3:02.5  0000:c3:03.0  0000:c3:03.3  0000:c3:03.6  0000:c3:04.1
  • onehost show <host-id> -j | jq .PCI_FILTER
oneadmin@test-gpu:~$ onehost show 0 -j | jq .PCI_FILTER
null
  • NVIDIA vGPU driver version and the host OS version
root@test-gpu:~# dpkg -l | grep -i nvidia
ii  nvidia-vgpu-ubuntu-580                 580.95.02                               amd64        NVIDIA vGPU driver - version 580.95.02
root@test-gpu:~# cat /etc/debian_version
bookworm/sid
root@test-gpu:~# cat /etc/os-release
PRETTY_NAME="Ubuntu 22.04.5 LTS"
NAME="Ubuntu"
VERSION_ID="22.04"
VERSION="22.04.5 LTS (Jammy Jellyfish)"
VERSION_CODENAME=jammy
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
UBUNTU_CODENAME=jammy

Thanks.

Hi @FrancJP ,

after debugging /var/tmp/one/im/kvm-probes.d/host/system/pci.rb file with rubymine, I have got that problem to generate “PCI list” is in line #174, where scripts says “next if matched != true”. With that line enabled, script executed an exit and “puts ’ PCI = [” doesn’t return anything; however, commenting out that line, scripts output is an array of “PCI = [” with all vGPUs.

However, in my template I dont’ know how assign a “default” GPU to allow that my 32 vGPUs (depending on L40S type) could take one of 32 automatically. What I want to avoid is generate 32 differents templates.

Thanks.

Hi,

Also, after several tests, I have seen that in “onehost show 0” I see it:

PCI DEVICES

   VM ADDR    TYPE           NAME
      c3:01.6 10de:26b9:0302 NVIDIA vGPU device
      c3:00.4 10de:26b9:0302 NVIDIA vGPU device
      c3:00.5 10de:26b9:0302 NVIDIA vGPU device
      c3:00.6 10de:26b9:0302 NVIDIA vGPU device
      c3:00.7 10de:26b9:0302 NVIDIA vGPU device
      c3:01.0 10de:26b9:0302 NVIDIA vGPU device
      c3:01.2 10de:26b9:0302 NVIDIA vGPU device
      c3:01.3 10de:26b9:0302 NVIDIA vGPU device
      c3:01.4 10de:26b9:0302 NVIDIA vGPU device
      c3:01.5 10de:26b9:0302 NVIDIA vGPU device
      c3:01.7 10de:26b9:0302 NVIDIA vGPU device
      c3:02.0 10de:26b9:0302 NVIDIA vGPU device
      c3:02.1 10de:26b9:0302 NVIDIA vGPU device
      c3:02.2 10de:26b9:0302 NVIDIA vGPU device
      c3:02.3 10de:26b9:0302 NVIDIA vGPU device
      c3:02.4 10de:26b9:0302 NVIDIA vGPU device
      c3:02.5 10de:26b9:0302 NVIDIA vGPU device
      c3:02.6 10de:26b9:0302 NVIDIA vGPU device
      c3:02.7 10de:26b9:0302 NVIDIA vGPU device
      c3:03.0 10de:26b9:0302 NVIDIA vGPU device
      c3:03.1 10de:26b9:0302 NVIDIA vGPU device
      c3:03.2 10de:26b9:0302 NVIDIA vGPU device
      c3:03.3 10de:26b9:0302 NVIDIA vGPU device
      c3:03.4 10de:26b9:0302 NVIDIA vGPU device
      c3:03.5 10de:26b9:0302 NVIDIA vGPU device
      c3:03.6 10de:26b9:0302 NVIDIA vGPU device
      c3:03.7 10de:26b9:0302 NVIDIA vGPU device
      c3:04.0 10de:26b9:0302 NVIDIA vGPU device
      c3:04.1 10de:26b9:0302 NVIDIA vGPU device
      c3:04.2 10de:26b9:0302 NVIDIA vGPU device
      c3:04.3 10de:26b9:0302 NVIDIA vGPU device

However, when I configure a generic template with “vGPU”, I have got some doubts:

  1. If I don’t use “Specific device” and not “profile”, I see this:

  2. If I don’t use "Specific device " but I select “profile”, I see this:

  3. If I use “Specific device”, I only see 5 but no more

I have created some templates with different configuration, but I don’t get a good behaviour. Sometimes, I can run two VMs with (supposed) vGPU and, other times, I can’t instantiate a VM because scheduler PCI device is unavailable or because “Not enough capacity in Host or System DS, dispatch limit reached, or limit of free leases reached”.

How could I get the correct configuration to allow 32 VMs with one vGPU each one? (assuming I want to use nvidia-1147 profile (after reading NVIDIA documentation, nvidia-1147 give a good relationship between resources and max vGPU allowed number).

Thanks in advance.

Hello @Daniel_Ruiz_Molina

Very sorry for the late reply, the issue is very particular. I’ve requested more help to the engineering team, they mentioned 2 things:

  • This issue 7420 is very similar to what you mention here.
  • Fix is expected for 7.2

However, we are trying to replicate this, and might give you more information soon.

Best regards,

Hello @FrancJP ,

After some days running more and more tests, I have reached a new state. Now, I can instantiate different vGPUs but what I have noticed is that new VMs takes many time to enter in scheduling process and, when these new VMs enter in scheduling, sometimes cannot be dispatched by resources or other causes… until monitor reexecutes monitor script over host, list all PCI devices and, I don’t know why, VM starts booting…

Really, I don’t know why.

Thanks.

Hi @Daniel_Ruiz_Molina,

Looks like that is similar to what has been fixed here:
7392

On 7.0.1 (CE) this has been fixed, so if you are using an earlier version, you will be hitting this bug.

Cheers,