I created a script that runs “onehost list --list=NAME,ZVM --csv” periodically, and gets the amount of zombie VMs we have. If this zombie count is greater than 0 for 5 minutes, I throw an alarm to Teams so that we get notified that there’s a problem. There’s just a slight problem. That command returns zombies that aren’t present, at least currently.
We keep spawning thousands of virtual machines per day, which get deleted soon after. So could there be a situation that OpenNebula is telling virsh to delete a VM, that’s still in the process of getting deleted, but is still present? In OpenNebula it’s however already been marked as non existing. And at that precise point in time my script runs the query about zombies, and it returns that one before virsh deletes it?
Then the poller doesn’t query zombies very frequently. I had time to log in to the host, run “virsh list” and log in to OpenNebula GUI, look at the VMs present on that host, and compare that to “onehost list” on the command line. Virsh and GUI both matched the same amount of VMs, but “onehost list” reported zombies. What’s the poll interval of this query? And even if I adjust it to be whatever, it’s still possible that it just happens to run the query at the precise moment something is being deleted, but it’s still present if you ask virsh.
Hello @tosaraja,
I’ve just test it myself and you are correct, there is quite a big delay between OpenNebula removes the VM from host and the time the monitoring reports the VM as poweroff. If in this period the the monitoring sends SYNC_STATE, then the VM is reported as zombie.
And it remains as zombie until next SYNC_STATE - this period can be set in /etc/one/monitord.conf, default value is 180 seconds.
Even if the OpenNebula receives update from monitoring about the poweroff VM state, it doesn’t remove it from zombies, it waits for next SYNC_STATE - at least this part should be an easy fix.
Right now, I’m not sure if we are able to fix the first case - not putting the VM to zombies, when it’s shutting down.
We have to be careful here, in the past we had some false poweroff states from virsh, which may trigger unnecessary hooks. That’s why we check several times before report the VM as poweroff.
If it’s a serious issue for you, feel free to create an github issue to get more attention.
Could we end up in this situation? We have all CPUs taken by VMs on a host and on top of that we have pending VMs that wont fit on the host. Then as the VMs is deleted, OpenNebula would free up those resources, CPUs and RAM, but virsh hasn’t destroyed the VM yet. Now OpenNebula would see free resources and try to boot up the pending VM, which would fail since virsh still has the VM running.