Opennebula 5.2.1
using PCI GPU passthrough.
There are 4 GPU cards in each (GPU) host.
ONE is not aware that the PCI device is already in use by a VM on the particular compute node, so the schedulere will try to deploy the VM and configure it to use the PCI device and will keep on trying this. As a work-around I disabled the node so it will not be used to deploy new VM’s on it.
Is there a way to “tell” ONE that this PCI device is in use and by what VM?
I am new to OpenNebula, and I could not find a way to update the information for a particular compute node (host). I also could not find relevant issues while searching the forum, documentation or other sources on the internet.
Thanks in advance,
Hans Feringa
Some additional information:
When retrieving information about the VM’s (onevm show vm-id) there is actually the information available that the PCI device is in use.
This is not shown in the host information (onehost show host-id). And this latter info seems to be used by the scheduler.
I have run onedb fsck recently, so I think it is an oversight of the onedb fsck procedure.
I also noticed that if ONE is not aware that the first address (or one of the first addresses) is in use, it will never try one of the other addresses that are available for the VM on that host. This VM is then in the failure state. This will then open the way for a next VM to be scheduled to use the next available (unused) resource on this particular host. We assumed that this was the case and was confirmed in our tests. The annoying thing is that a failed VM is never tried on another host, and is actually stuck on the node where ONE thinks that the allocated device/resource is still available.
The information for the host is in the table host_pool in field body. It shows that for the XML tag VMID has a value of ![CDATA[-1] while it should have (in this case) ![CDATA[25400].
In the body field of the vm_pool table (information for the VM), in the XML blob the information regarding the usage of the PCI device is present, with the correct address, bus etc info. So clearly this information between the two tables is out of sync.