Zombie polling interval and mismatch

tosaraja · September 18, 2024, 7:37am

I created a script that runs “onehost list --list=NAME,ZVM --csv” periodically, and gets the amount of zombie VMs we have. If this zombie count is greater than 0 for 5 minutes, I throw an alarm to Teams so that we get notified that there’s a problem. There’s just a slight problem. That command returns zombies that aren’t present, at least currently.

We keep spawning thousands of virtual machines per day, which get deleted soon after. So could there be a situation that OpenNebula is telling virsh to delete a VM, that’s still in the process of getting deleted, but is still present? In OpenNebula it’s however already been marked as non existing. And at that precise point in time my script runs the query about zombies, and it returns that one before virsh deletes it?

Then the poller doesn’t query zombies very frequently. I had time to log in to the host, run “virsh list” and log in to OpenNebula GUI, look at the VMs present on that host, and compare that to “onehost list” on the command line. Virsh and GUI both matched the same amount of VMs, but “onehost list” reported zombies. What’s the poll interval of this query? And even if I adjust it to be whatever, it’s still possible that it just happens to run the query at the precise moment something is being deleted, but it’s still present if you ask virsh.

Am I wrong and what’s going on here?

pczerny · September 18, 2024, 8:41pm

Hello @tosaraja,
I’ve just test it myself and you are correct, there is quite a big delay between OpenNebula removes the VM from host and the time the monitoring reports the VM as poweroff. If in this period the the monitoring sends SYNC_STATE, then the VM is reported as zombie.

And it remains as zombie until next SYNC_STATE - this period can be set in /etc/one/monitord.conf, default value is 180 seconds.

Even if the OpenNebula receives update from monitoring about the poweroff VM state, it doesn’t remove it from zombies, it waits for next SYNC_STATE - at least this part should be an easy fix.

Right now, I’m not sure if we are able to fix the first case - not putting the VM to zombies, when it’s shutting down.

We have to be careful here, in the past we had some false poweroff states from virsh, which may trigger unnecessary hooks. That’s why we check several times before report the VM as poweroff.

If it’s a serious issue for you, feel free to create an github issue to get more attention.

tosaraja · September 19, 2024, 6:49am

Could we end up in this situation? We have all CPUs taken by VMs on a host and on top of that we have pending VMs that wont fit on the host. Then as the VMs is deleted, OpenNebula would free up those resources, CPUs and RAM, but virsh hasn’t destroyed the VM yet. Now OpenNebula would see free resources and try to boot up the pending VM, which would fail since virsh still has the VM running.

tosaraja · September 19, 2024, 6:49am

And bug created: Zombies appearing at shutdown · Issue #6732 · OpenNebula/one · GitHub

tosaraja · May 2, 2025, 1:06pm

Another question related to this. If we shut down a VM, and the host is still shutting it down, but hasn’t done that yet. Is the MAC address available for reuse? We are suspecting we have duplicate MAC addresses in use, and this could be the culprit. Hence, we’d like to see progress in the round-robin type of MAC address lease so that the whole MAC address range is cycled through before reusing the same MAC address.
See: Network: do not reuse MAC until range is used up

FrancJP · May 14, 2025, 10:32am

Hello @tosaraja,

I don’t know if you checked this topic:

Looks very similar, so perhaps it can give you a hint on the approach.

Cheers,

Topic		Replies	Views
Don't see remote VM status Product Support	8	1045	September 12, 2016
Monitoring hosts Product Support	5	2086	February 5, 2017
What's wrong? maybe a bug? Product Support	7	715	April 26, 2017
Host failure handling interval Product Support	4	2249	March 16, 2015
Changing VM monitoring interval Product Support	2	361	October 30, 2020

Zombie polling interval and mismatch

Related topics