Random VMs incorrectly in POWEROFF state after upgrade to 5.12

Hello,

after upgrading from 5.10 to 5.12.0.4 many VMs on different hosts are seen in POWEROFF state with the infamous message:

Tue Aug 31 15:47:17 2021 [Z0][LCM][I]: VM running but monitor state is POWEROFF
Tue Aug 31 15:47:17 2021 [Z0][VM][I]: New LCM state is SHUTDOWN_POWEROFF
Tue Aug 31 15:47:19 2021 [Z0][VM][I]: New state is POWEROFF
Tue Aug 31 15:47:19 2021 [Z0][VM][I]: New LCM state is LCM_INIT

virsh list shows the VMs within ca. 1s, however using poll.rb, I only see 2-3 VMs in the output.

[14:43:20] server10.place6:/var/tmp/one/im/kvm-probes.d/vm$ ruby monitor/poll.rb  | wc -l                                                                                   
1
[14:43:25] server10.place6:/var/tmp/one/im/kvm-probes.d/vm$
[14:43:29] server10.place6:/var/tmp/one/im/kvm-probes.d/vm$ ruby monitor/poll.rb  | wc -l                                                                                   
3
[14:45:24] server10.place6:/var/tmp/one/im/kvm-probes.d/vm$ virsh --connect qemu:///system list | wc -l                                                                      
188

I already ran onedb fsck, which correct some quotas and complains about incorrect leases, but aside from that did not report any issue.

What’s the next best step for debugging this?

The monitor process changed between those two. Best process is to remove the /var/tmp/one process in the hosts and execute a onehost sync -f to force the update of the hosts.

The you can disable/enable the hosts to trigger a restart of the monitor agent (or wait a couple of minutes for the automatic recovery…)