Monitoring falsely reports vm in "poff" state

I have been recovering my cloud from a hard power fail this morning. All the VM’s on the nodes were
in “unkn” state. I executed onevm boot on them and they have all come back up. Looking at a single VM the problem is the following:

1920 matyas users cce-sl6h runn 0 1.9G fcl005 31d 19h54
1931 matyas users sw1790-sl6h runn 0 1.9G fcl005 27d 19h50
2026 zvada users CLI_DynamicIP_S poff 0 0K fcl005 23d 19h53
2038 blin users CLI_DynamicIP_S poff 0 0K fcl005 21d 02h19
3246 tlevshin users gratiaweb poff 0 0K fcl005 9d 18h12
3255 oneadmin oneadmin deswn poff 0 0K fcl005 8d 23h02
3264 oneadmin oneadmin deswn poff 0 0K fcl005 8d 22h58

Onevm list shows that there are 7 vm’s on the host, 2 runn and 5 in poff.
But in fact all 7 vm’s are running.

[root@fcl005 ~]# virsh list
Id Name State

1 one-1920 running
2 one-1931 running
3 one-2026 running
4 one-2038 running
5 one-3246 running
6 one-3255 running
7 one-3264 running

and I can log into them all and they are all pingable.

the monitoring ruby script seems to be runing OK and all remote files appear to be in order.

what might cause this? this node has been in this state for a couple hours now, so I don’t think
it is a transient. how to reset it?

I am running OpenNebula 4.8.

Steve Timm

I see this is related to bug 3212 which is supposedly fixed in opennebula 4.10.2
I do not have the human resources available to upgrade to latest at the moment
Any advice on how to reset an operating Opennebula 4.8.0 to correctly reflect the
state of the VM’s? We need to get this fixed.

PS–this is most likely to happen when you do onevm boot on all the VM’s on a large node at once, or
close to at once. There is a transient state where libvirt reports the VM in poweroff state
just before it starts, and this can last longer than expected if you are doing a bunch of them at once.

Hi,

If you are in 4.8, I think the safest thing to do is shutdown the guests, and then try with the onevm resume command.

PS we’ve been working hard on master to improve these manual recovery situations. 4.14 will be able to handle your problem with a ‘recover --success’ action from sunstone.