Problems with NVIDIA vGPU

Daniel_Ruiz_Molina · December 19, 2025, 11:39am

Hello,

I’m trying to share a GPU NVIDIA L40S using vGPU for some VMs (maybe 32, depending on vGPU applied L40S-1Q. First of all, I have read NVIDIA documentation and, after requesting a 90-day free licenses, I have installed “nvidia-vgpu” driver in my Ubuntu-22.04.
Then, I have got some “virtual functions” from NVIDIA driver, I have got lspci information and it seems that from the operative systems, there are 32 vgpus.
After that, I haver read and execute NVIDIA vGPU & MIG | and PCI Passthrough |. However, I haven’t got any available PCI device.

My /var/lib/one/remotes/etc/im/kvm-probes.d/pci.conf is:

:filter:
  - '10de:*'

:short_address:
  - 'c3:00.4'
  - 'c3:00.5'
  - 'c3:00.6'
  - 'c3:00.7'
  - 'c3:01.0'
  - 'c3:01.2'
  - 'c3:01.3'
  - 'c3:01.4'
  - 'c3:01.5'
  - 'c3:01.6'
  - 'c3:01.7'
  - 'c3:02.0'
  - 'c3:02.1'
  - 'c3:02.2'
  - 'c3:02.3'
  - 'c3:02.4'
  - 'c3:02.5'
  - 'c3:02.6'
  - 'c3:02.7'
  - 'c3:03.0'
  - 'c3:03.1'
  - 'c3:03.2'
  - 'c3:03.3'
  - 'c3:03.4'
  - 'c3:03.5'
  - 'c3:03.6'
  - 'c3:03.7'
  - 'c3:04.0'
  - 'c3:04.1'
  - 'c3:04.2'
  - 'c3:04.3'

:device_name:
  - 'NVIDIA L40S'

:nvidia_vendors:
  - '10de'

From CLI:

virsh nodedev-dumpxml pci_0000_c3_00_0 | egrep 'domain|bus|slot|function'
    <domain>0</domain>
    <bus>195</bus>
    <slot>0</slot>
    <function>0</function>
    <capability type='virt_functions' maxCount='32'>
      <address domain='0x0000' bus='0xc3' slot='0x00' function='0x4'/>
      <address domain='0x0000' bus='0xc3' slot='0x00' function='0x5'/>
      <address domain='0x0000' bus='0xc3' slot='0x00' function='0x6'/>
      <address domain='0x0000' bus='0xc3' slot='0x00' function='0x7'/>
      <address domain='0x0000' bus='0xc3' slot='0x01' function='0x0'/>
      <address domain='0x0000' bus='0xc3' slot='0x01' function='0x1'/>
      <address domain='0x0000' bus='0xc3' slot='0x01' function='0x2'/>
      <address domain='0x0000' bus='0xc3' slot='0x01' function='0x3'/>
      <address domain='0x0000' bus='0xc3' slot='0x01' function='0x4'/>
      <address domain='0x0000' bus='0xc3' slot='0x01' function='0x5'/>
      <address domain='0x0000' bus='0xc3' slot='0x01' function='0x6'/>
      <address domain='0x0000' bus='0xc3' slot='0x01' function='0x7'/>
      <address domain='0x0000' bus='0xc3' slot='0x02' function='0x0'/>
      <address domain='0x0000' bus='0xc3' slot='0x02' function='0x1'/>
      <address domain='0x0000' bus='0xc3' slot='0x02' function='0x2'/>
      <address domain='0x0000' bus='0xc3' slot='0x02' function='0x3'/>
      <address domain='0x0000' bus='0xc3' slot='0x02' function='0x4'/>
      <address domain='0x0000' bus='0xc3' slot='0x02' function='0x5'/>
      <address domain='0x0000' bus='0xc3' slot='0x02' function='0x6'/>
      <address domain='0x0000' bus='0xc3' slot='0x02' function='0x7'/>
      <address domain='0x0000' bus='0xc3' slot='0x03' function='0x0'/>
      <address domain='0x0000' bus='0xc3' slot='0x03' function='0x1'/>
      <address domain='0x0000' bus='0xc3' slot='0x03' function='0x2'/>
      <address domain='0x0000' bus='0xc3' slot='0x03' function='0x3'/>
      <address domain='0x0000' bus='0xc3' slot='0x03' function='0x4'/>
      <address domain='0x0000' bus='0xc3' slot='0x03' function='0x5'/>
      <address domain='0x0000' bus='0xc3' slot='0x03' function='0x6'/>
      <address domain='0x0000' bus='0xc3' slot='0x03' function='0x7'/>
      <address domain='0x0000' bus='0xc3' slot='0x04' function='0x0'/>
      <address domain='0x0000' bus='0xc3' slot='0x04' function='0x1'/>
      <address domain='0x0000' bus='0xc3' slot='0x04' function='0x2'/>
      <address domain='0x0000' bus='0xc3' slot='0x04' function='0x3'/>
      <address domain='0x0000' bus='0xc3' slot='0x00' function='0x0'/>

It seems all process based on NVIDIA documentation has been finished correctly, but from OpenNebula, after modifying /var/lib/one/remotes/etc/im/kvm-probes.d/pci.conf and run “onehost sync --force”, it seems that something is not working fine, because “onehost show -j 0” shows it:

[...]
        "MAX_DISK": "1800515",
        "USED_DISK": "74082"
      },
      "PCI_DEVICES": {},
      "NUMA_NODES": {
        "NODE": {
          "CORE": [
            {
              "CPUS": "21:-1,45:-1",
              "DEDICATED": "NO",
              "FREE": "2",
[...]

There is NO PCI_DEVICES, but:

oneadmin@test-gpu:~$ onehost show 0 -j | grep PCI
      "PCI_DEVICES": {},
      "PCI_FILTER": "10de:26b9",

Then, from Sunstone, if I try to modify host in Infrastructure menu, in PCI tab I can’t see any PCI device:

So I can’t attach any PCI device (any NVIDIA vGPU device) to any VM.
Also, if I modify template and in “PCI Devices” add a PCI as “Specific device” with value “c3:01.3”, VM not starts and scheduler returns this error in rank_sched.log:
rank_sched.log:Fri Dec 19 11:39:56 2025 [Z0][SCHED][DD]: Host 0 discarded for VM 19. Unavailable PCI device.

What am I doing wrong? Am I missing something?

Please, I need help. If someone want more detailed information about what is configured in my system, please, tell me.

Thanks!!!

FrancJP · December 27, 2025, 12:30pm

Hello @Daniel_Ruiz_Molina,

Happy holidays, and apologies for the late reply.

Please, could you share the output of:

lspci -nn | grep -i nvidia
ls /sys/class/mdev_bus/
onehost show <host-id> -j | jq .PCI_FILTER
Confirm the NVIDIA vGPU driver version and the host OS version.

I’ve already asked more help to the engineering team.

Regards,

Daniel_Ruiz_Molina · January 7, 2026, 8:20am

Hello @FrancJP

Happy new year 2026!!!

Outputs:

* lspci -nn | grep -i nvidia

c3:00.0 3D controller [0302]: NVIDIA Corporation Device [10de:26b9] (rev a1)
c3:00.4 3D controller [0302]: NVIDIA Corporation Device [10de:26b9] (rev a1)
c3:00.5 3D controller [0302]: NVIDIA Corporation Device [10de:26b9] (rev a1)
c3:00.6 3D controller [0302]: NVIDIA Corporation Device [10de:26b9] (rev a1)
c3:00.7 3D controller [0302]: NVIDIA Corporation Device [10de:26b9] (rev a1)
c3:01.0 3D controller [0302]: NVIDIA Corporation Device [10de:26b9] (rev a1)
c3:01.1 3D controller [0302]: NVIDIA Corporation Device [10de:26b9] (rev a1)
c3:01.2 3D controller [0302]: NVIDIA Corporation Device [10de:26b9] (rev a1)
c3:01.3 3D controller [0302]: NVIDIA Corporation Device [10de:26b9] (rev a1)
c3:01.4 3D controller [0302]: NVIDIA Corporation Device [10de:26b9] (rev a1)
c3:01.5 3D controller [0302]: NVIDIA Corporation Device [10de:26b9] (rev a1)
c3:01.6 3D controller [0302]: NVIDIA Corporation Device [10de:26b9] (rev a1)
c3:01.7 3D controller [0302]: NVIDIA Corporation Device [10de:26b9] (rev a1)
c3:02.0 3D controller [0302]: NVIDIA Corporation Device [10de:26b9] (rev a1)
c3:02.1 3D controller [0302]: NVIDIA Corporation Device [10de:26b9] (rev a1)
c3:02.2 3D controller [0302]: NVIDIA Corporation Device [10de:26b9] (rev a1)
c3:02.3 3D controller [0302]: NVIDIA Corporation Device [10de:26b9] (rev a1)
c3:02.4 3D controller [0302]: NVIDIA Corporation Device [10de:26b9] (rev a1)
c3:02.5 3D controller [0302]: NVIDIA Corporation Device [10de:26b9] (rev a1)
c3:02.6 3D controller [0302]: NVIDIA Corporation Device [10de:26b9] (rev a1)
c3:02.7 3D controller [0302]: NVIDIA Corporation Device [10de:26b9] (rev a1)
c3:03.0 3D controller [0302]: NVIDIA Corporation Device [10de:26b9] (rev a1)
c3:03.1 3D controller [0302]: NVIDIA Corporation Device [10de:26b9] (rev a1)
c3:03.2 3D controller [0302]: NVIDIA Corporation Device [10de:26b9] (rev a1)
c3:03.3 3D controller [0302]: NVIDIA Corporation Device [10de:26b9] (rev a1)
c3:03.4 3D controller [0302]: NVIDIA Corporation Device [10de:26b9] (rev a1)
c3:03.5 3D controller [0302]: NVIDIA Corporation Device [10de:26b9] (rev a1)
c3:03.6 3D controller [0302]: NVIDIA Corporation Device [10de:26b9] (rev a1)
c3:03.7 3D controller [0302]: NVIDIA Corporation Device [10de:26b9] (rev a1)
c3:04.0 3D controller [0302]: NVIDIA Corporation Device [10de:26b9] (rev a1)
c3:04.1 3D controller [0302]: NVIDIA Corporation Device [10de:26b9] (rev a1)
c3:04.2 3D controller [0302]: NVIDIA Corporation Device [10de:26b9] (rev a1)
c3:04.3 3D controller [0302]: NVIDIA Corporation Device [10de:26b9] (rev a1)

ls /sys/class/mdev_bus/

0000:c3:00.4  0000:c3:00.7  0000:c3:01.2  0000:c3:01.5  0000:c3:02.0  0000:c3:02.3  0000:c3:02.6  0000:c3:03.1  0000:c3:03.4  0000:c3:03.7  0000:c3:04.2
0000:c3:00.5  0000:c3:01.0  0000:c3:01.3  0000:c3:01.6  0000:c3:02.1  0000:c3:02.4  0000:c3:02.7  0000:c3:03.2  0000:c3:03.5  0000:c3:04.0  0000:c3:04.3
0000:c3:00.6  0000:c3:01.1  0000:c3:01.4  0000:c3:01.7  0000:c3:02.2  0000:c3:02.5  0000:c3:03.0  0000:c3:03.3  0000:c3:03.6  0000:c3:04.1

onehost show <host-id> -j | jq .PCI_FILTER

oneadmin@test-gpu:~$ onehost show 0 -j | jq .PCI_FILTER
null

NVIDIA vGPU driver version and the host OS version

root@test-gpu:~# dpkg -l | grep -i nvidia
ii  nvidia-vgpu-ubuntu-580                 580.95.02                               amd64        NVIDIA vGPU driver - version 580.95.02
root@test-gpu:~# cat /etc/debian_version
bookworm/sid
root@test-gpu:~# cat /etc/os-release
PRETTY_NAME="Ubuntu 22.04.5 LTS"
NAME="Ubuntu"
VERSION_ID="22.04"
VERSION="22.04.5 LTS (Jammy Jellyfish)"
VERSION_CODENAME=jammy
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
UBUNTU_CODENAME=jammy

Thanks.

Daniel_Ruiz_Molina · January 12, 2026, 7:28am

Hi @FrancJP ,

after debugging /var/tmp/one/im/kvm-probes.d/host/system/pci.rb file with rubymine, I have got that problem to generate “PCI list” is in line #174, where scripts says “next if matched != true”. With that line enabled, script executed an exit and “puts ’ PCI = [” doesn’t return anything; however, commenting out that line, scripts output is an array of “PCI = [” with all vGPUs.

However, in my template I dont’ know how assign a “default” GPU to allow that my 32 vGPUs (depending on L40S type) could take one of 32 automatically. What I want to avoid is generate 32 differents templates.

Thanks.

Daniel_Ruiz_Molina · January 12, 2026, 2:34pm

Hi,

Also, after several tests, I have seen that in “onehost show 0” I see it:

PCI DEVICES

   VM ADDR    TYPE           NAME
      c3:01.6 10de:26b9:0302 NVIDIA vGPU device
      c3:00.4 10de:26b9:0302 NVIDIA vGPU device
      c3:00.5 10de:26b9:0302 NVIDIA vGPU device
      c3:00.6 10de:26b9:0302 NVIDIA vGPU device
      c3:00.7 10de:26b9:0302 NVIDIA vGPU device
      c3:01.0 10de:26b9:0302 NVIDIA vGPU device
      c3:01.2 10de:26b9:0302 NVIDIA vGPU device
      c3:01.3 10de:26b9:0302 NVIDIA vGPU device
      c3:01.4 10de:26b9:0302 NVIDIA vGPU device
      c3:01.5 10de:26b9:0302 NVIDIA vGPU device
      c3:01.7 10de:26b9:0302 NVIDIA vGPU device
      c3:02.0 10de:26b9:0302 NVIDIA vGPU device
      c3:02.1 10de:26b9:0302 NVIDIA vGPU device
      c3:02.2 10de:26b9:0302 NVIDIA vGPU device
      c3:02.3 10de:26b9:0302 NVIDIA vGPU device
      c3:02.4 10de:26b9:0302 NVIDIA vGPU device
      c3:02.5 10de:26b9:0302 NVIDIA vGPU device
      c3:02.6 10de:26b9:0302 NVIDIA vGPU device
      c3:02.7 10de:26b9:0302 NVIDIA vGPU device
      c3:03.0 10de:26b9:0302 NVIDIA vGPU device
      c3:03.1 10de:26b9:0302 NVIDIA vGPU device
      c3:03.2 10de:26b9:0302 NVIDIA vGPU device
      c3:03.3 10de:26b9:0302 NVIDIA vGPU device
      c3:03.4 10de:26b9:0302 NVIDIA vGPU device
      c3:03.5 10de:26b9:0302 NVIDIA vGPU device
      c3:03.6 10de:26b9:0302 NVIDIA vGPU device
      c3:03.7 10de:26b9:0302 NVIDIA vGPU device
      c3:04.0 10de:26b9:0302 NVIDIA vGPU device
      c3:04.1 10de:26b9:0302 NVIDIA vGPU device
      c3:04.2 10de:26b9:0302 NVIDIA vGPU device
      c3:04.3 10de:26b9:0302 NVIDIA vGPU device

However, when I configure a generic template with “vGPU”, I have got some doubts:

If I don’t use “Specific device” and not “profile”, I see this:

attach PCI device not specific and not profile1539×363 6.47 KB
If I don’t use "Specific device " but I select “profile”, I see this:

attach PCI device not specific and list profiles1348×687 10.3 KB
If I use “Specific device”, I only see 5 but no more

attach PCI device specific1046×547 11 KB

I have created some templates with different configuration, but I don’t get a good behaviour. Sometimes, I can run two VMs with (supposed) vGPU and, other times, I can’t instantiate a VM because scheduler PCI device is unavailable or because “Not enough capacity in Host or System DS, dispatch limit reached, or limit of free leases reached”.

How could I get the correct configuration to allow 32 VMs with one vGPU each one? (assuming I want to use nvidia-1147 profile (after reading NVIDIA documentation, nvidia-1147 give a good relationship between resources and max vGPU allowed number).

Thanks in advance.

FrancJP · January 15, 2026, 10:23am

Hello @Daniel_Ruiz_Molina

Very sorry for the late reply, the issue is very particular. I’ve requested more help to the engineering team, they mentioned 2 things:

This issue 7420 is very similar to what you mention here.
Fix is expected for 7.2

However, we are trying to replicate this, and might give you more information soon.

Best regards,

Daniel_Ruiz_Molina · January 15, 2026, 10:40am

Hello @FrancJP ,

After some days running more and more tests, I have reached a new state. Now, I can instantiate different vGPUs but what I have noticed is that new VMs takes many time to enter in scheduling process and, when these new VMs enter in scheduling, sometimes cannot be dispatched by resources or other causes… until monitor reexecutes monitor script over host, list all PCI devices and, I don’t know why, VM starts booting…

Really, I don’t know why.

Thanks.

FrancJP · January 16, 2026, 1:43pm

Hi @Daniel_Ruiz_Molina,

Looks like that is similar to what has been fixed here:
7392

On 7.0.1 (CE) this has been fixed, so if you are using an earlier version, you will be hitting this bug.

Cheers,

Daniel_Ruiz_Molina · January 19, 2026, 1:39pm

Hi @FrancJP,

my server is running 7.0.1-1 CE:

However, by the moment, I can run all tests in this environment.

How could I know if I have latest files to ensure a correct behaviour?

Thanks.

FrancJP · January 23, 2026, 1:19pm

Hi @Daniel_Ruiz_Molina,

Sorry for the delayed reply, we are trying to replicate the issue.

In the meantime, and regarding your last comment, Engineering is suggesting that you check whether the local code you’re running includes the changes from this commit on the OpenNebula repo:
Commit

It contains monitoring and vGPU-related updates that might be relevant. If your local sources are older or differ from the upstream master where that commit is included, that might explain unexpected behavior. If the code differs, try syncing to the latest upstream and re-testing.

Cheers,

Daniel_Ruiz_Molina · January 26, 2026, 12:18pm

@FrancJP

After updating both files, it seems “monitor” works fine.

Thanks!

FrancJP · January 27, 2026, 10:25am

Hi @Daniel_Ruiz_Molina,

Thanks, have you updated the platform to 7.0.2 then?

Daniel_Ruiz_Molina · January 27, 2026, 10:30am

Hi,

No, I have only updated both files, but not all OpenNebula packages. However, with that two new files, “monitor” seems to run better than before.

Thanks.

Topic		Replies	Views
Enable the NVIDIA vGPU with non SR-IOV support card Product Support	4	2791	March 22, 2024
Vm_pool and host_pool table out of sync resulting in error: Requested operation is not valid: (GPU) PCI device in use by driver qemu Product Support	5	2408	April 17, 2018
I can not PCI Passthrough Product Support	9	1335	September 30, 2021
VGPU support with multiple physical GPU on same board Product Support	5	567	February 13, 2024
Passthrough vGPU Product Support	0	215	July 12, 2024

Problems with NVIDIA vGPU

Related topics