Shrödinger's virtual machine: simultaneously in RUNNING and POWEROFF states

Hi all,

I’ve managed to get myself into a pickle with OpenNebula 5.4.13… I had to evacuate one of my hosts as I had a hardware failure in my storage cluster (Ceph) and thus needed the motherboard from my second compute node to use to get the storage cluster back up properly.

I shut both VMs that were running on this host down cleanly, then attempted to re-deploy them on the remaining compute node. I first attempted this with migration (the VM host at this point had already been powered off and re-purposed), then when that failed, told it to delete and re-create the VMs (they run on persistent disks in Ceph, so data loss was not a concern). This seemed to work, they appeared on the remaining compute node and started to boot.

Then they got marked as POWEROFF.
Then OpenNebula tried booting them up a second time.
libvirt on the host rightly told OpenNebula to bugger off because the instance was already there.
OpenNebula created another deployment file and tries to re-create the VMs yet again.
libvirt again tells OpenNebula to go take a long walk.
OpenNebula fails to take the hint and tries re-creating the VMs yet again.

Meanwhile the VMs in question are happily running on the host in the midst of this VM creation war between the OpenNebula front-end and the VM host.

Fri May 17 20:40:10 2019 [Z0][VM][I]: New LCM state is LCM_INIT
Fri May 17 20:40:10 2019 [Z0][VM][I]: New state is PENDING
Fri May 17 20:44:03 2019 [Z0][VM][I]: New state is ACTIVE
Fri May 17 20:44:17 2019 [Z0][VM][I]: New LCM state is PROLOG
Fri May 17 20:46:10 2019 [Z0][VM][I]: New LCM state is BOOT
Fri May 17 20:46:22 2019 [Z0][VMM][I]: Generating deployment file: /var/lib/one/vms/42/deployment.15
Fri May 17 20:46:22 2019 [Z0][VM][I]: Virtual Machine has no context
Fri May 17 20:46:29 2019 [Z0][VMM][I]: ExitCode: 0
Fri May 17 20:46:29 2019 [Z0][VMM][I]: Successfully execute network driver operation: pre.
Fri May 17 20:46:29 2019 [Z0][VMM][I]: ExitCode: 0
Fri May 17 20:46:29 2019 [Z0][VMM][I]: Successfully execute virtualization driver operation: deploy.
Fri May 17 20:46:29 2019 [Z0][VMM][I]: ExitCode: 0
Fri May 17 20:46:29 2019 [Z0][VMM][I]: Successfully execute network driver operation: post.
Fri May 17 20:46:38 2019 [Z0][VM][I]: New LCM state is RUNNING
Fri May 17 20:52:46 2019 [Z0][LCM][I]: VM running but monitor state is POWEROFF
Fri May 17 20:52:46 2019 [Z0][VM][I]: New LCM state is SHUTDOWN_POWEROFF
Fri May 17 20:53:16 2019 [Z0][VM][I]: New state is POWEROFF
Fri May 17 20:53:16 2019 [Z0][VM][I]: New LCM state is LCM_INIT
Fri May 17 20:55:36 2019 [Z0][VM][I]: New state is ACTIVE
Fri May 17 20:55:36 2019 [Z0][VM][I]: New LCM state is BOOT_POWEROFF
Fri May 17 20:55:56 2019 [Z0][VMM][I]: Generating deployment file: /var/lib/one/vms/42/deployment.16
Fri May 17 20:55:56 2019 [Z0][VM][I]: Virtual Machine has no context
Fri May 17 20:56:02 2019 [Z0][VMM][I]: ExitCode: 0
Fri May 17 20:56:02 2019 [Z0][VMM][I]: Successfully execute network driver operation: pre.
Fri May 17 20:56:02 2019 [Z0][VMM][I]: Command execution fail: cat << EOT | /var/tmp/one/vmm/kvm/deploy '/var/lib/one//datastores/0/42/deployment.16' 'beryllium.vhost.lan' 42 beryllium.vhost.lan
Fri May 17 20:56:02 2019 [Z0][VMM][I]: error: Failed to create domain from /var/lib/one//datastores/0/42/deployment.16
Fri May 17 20:56:02 2019 [Z0][VMM][I]: error: operation failed: domain 'one-42' already exists with uuid fc9fadaa-9028-4371-860c-8105672fb5cb
Fri May 17 20:56:02 2019 [Z0][VMM][E]: Could not create domain from /var/lib/one//datastores/0/42/deployment.16
Fri May 17 20:56:02 2019 [Z0][VMM][I]: ExitCode: 255
Fri May 17 20:56:03 2019 [Z0][VMM][I]: Failed to execute virtualization driver operation: deploy.
Fri May 17 20:56:03 2019 [Z0][VMM][E]: Error deploying virtual machine: Could not create domain from /var/lib/one//datastores/0/42/deployment.16
Fri May 17 20:56:13 2019 [Z0][VM][I]: New state is POWEROFF
Fri May 17 20:56:13 2019 [Z0][VM][I]: New LCM state is LCM_INIT
Fri May 17 21:00:16 2019 [Z0][VM][I]: New state is ACTIVE
Fri May 17 21:00:16 2019 [Z0][VM][I]: New LCM state is BOOT_POWEROFF
Fri May 17 21:00:36 2019 [Z0][VMM][I]: Generating deployment file: /var/lib/one/vms/42/deployment.17
Fri May 17 21:00:36 2019 [Z0][VM][I]: Virtual Machine has no context
Fri May 17 21:00:47 2019 [Z0][VMM][I]: ExitCode: 0
Fri May 17 21:00:49 2019 [Z0][VMM][I]: Successfully execute network driver operation: pre.
Fri May 17 21:00:49 2019 [Z0][VMM][I]: Command execution fail: cat << EOT | /var/tmp/one/vmm/kvm/deploy '/var/lib/one//datastores/0/42/deployment.17' 'beryllium.vhost.lan' 42 beryllium.vhost.lan
Fri May 17 21:00:49 2019 [Z0][VMM][I]: error: Failed to create domain from /var/lib/one//datastores/0/42/deployment.17
Fri May 17 21:00:49 2019 [Z0][VMM][I]: error: operation failed: domain 'one-42' already exists with uuid fc9fadaa-9028-4371-860c-8105672fb5cb
Fri May 17 21:00:49 2019 [Z0][VMM][E]: Could not create domain from /var/lib/one//datastores/0/42/deployment.17
Fri May 17 21:00:49 2019 [Z0][VMM][I]: ExitCode: 255
Fri May 17 21:00:49 2019 [Z0][VMM][I]: Failed to execute virtualization driver operation: deploy.
Fri May 17 21:00:49 2019 [Z0][VMM][E]: Error deploying virtual machine: Could not create domain from /var/lib/one//datastores/0/42/deployment.17
Fri May 17 21:00:59 2019 [Z0][VM][I]: New state is POWEROFF
Fri May 17 21:00:59 2019 [Z0][VM][I]: New LCM state is LCM_INIT
Fri May 17 21:02:35 2019 [Z0][VM][I]: New state is ACTIVE
Fri May 17 21:02:35 2019 [Z0][VM][I]: New LCM state is HOTPLUG_EPILOG_POWEROFF
Fri May 17 21:03:29 2019 [Z0][LCM][I]: VM Disk successfully detached.
Fri May 17 21:03:37 2019 [Z0][VM][I]: New state is POWEROFF
Fri May 17 21:04:39 2019 [Z0][VM][I]: New LCM state is BOOT_POWEROFF
Fri May 17 21:04:59 2019 [Z0][VMM][I]: Generating deployment file: /var/lib/one/vms/42/deployment.18
Fri May 17 21:04:59 2019 [Z0][VM][I]: Virtual Machine has no context
Fri May 17 21:05:14 2019 [Z0][VMM][I]: ExitCode: 0
Fri May 17 21:05:32 2019 [Z0][VMM][I]: Successfully execute network driver operation: pre.
Fri May 17 21:05:32 2019 [Z0][VMM][I]: ExitCode: 0
Fri May 17 21:05:32 2019 [Z0][VMM][I]: Successfully execute virtualization driver operation: deploy.
Fri May 17 21:05:32 2019 [Z0][VMM][I]: ExitCode: 0
Fri May 17 21:05:32 2019 [Z0][VMM][I]: Successfully execute network driver operation: post.
Fri May 17 21:05:44 2019 [Z0][VM][I]: New LCM state is RUNNING
Fri May 17 21:11:25 2019 [Z0][LCM][I]: VM running but monitor state is POWEROFF
Fri May 17 21:11:25 2019 [Z0][VM][I]: New LCM state is SHUTDOWN_POWEROFF
Fri May 17 21:11:54 2019 [Z0][VM][I]: New state is POWEROFF
Fri May 17 21:11:54 2019 [Z0][VM][I]: New LCM state is LCM_INIT
Fri May 17 21:15:06 2019 [Z0][VM][I]: New state is ACTIVE
Fri May 17 21:15:06 2019 [Z0][VM][I]: New LCM state is BOOT_POWEROFF
Fri May 17 21:15:35 2019 [Z0][VMM][I]: Generating deployment file: /var/lib/one/vms/42/deployment.19
Fri May 17 21:15:35 2019 [Z0][VM][I]: Virtual Machine has no context
Fri May 17 21:15:44 2019 [Z0][VMM][I]: ExitCode: 0
Fri May 17 21:15:44 2019 [Z0][VMM][I]: Successfully execute network driver operation: pre.
Fri May 17 21:15:44 2019 [Z0][VMM][I]: Command execution fail: cat << EOT | /var/tmp/one/vmm/kvm/deploy '/var/lib/one//datastores/0/42/deployment.19' 'beryllium.vhost.lan' 42 beryllium.vhost.lan
Fri May 17 21:15:44 2019 [Z0][VMM][I]: error: Failed to create domain from /var/lib/one//datastores/0/42/deployment.19
Fri May 17 21:15:44 2019 [Z0][VMM][I]: error: operation failed: domain 'one-42' already exists with uuid a6589a7f-a401-4ef1-9f9c-05f6fc2f91e6
Fri May 17 21:15:44 2019 [Z0][VMM][E]: Could not create domain from /var/lib/one//datastores/0/42/deployment.19
Fri May 17 21:15:44 2019 [Z0][VMM][I]: ExitCode: 255
Fri May 17 21:15:44 2019 [Z0][VMM][I]: Failed to execute virtualization driver operation: deploy.
Fri May 17 21:15:44 2019 [Z0][VMM][E]: Error deploying virtual machine: Could not create domain from /var/lib/one//datastores/0/42/deployment.19
Fri May 17 21:15:50 2019 [Z0][VM][I]: New state is POWEROFF

If I SSH into the affected server, and issue a poweroff command as root, the VM does start back up again (at the direction of oned), but then OpenNebula seemingly forgets that it started the VM and proceeds to re-create it. If I tell it to “recover” the VM, it tells me it can’t do that because it’s powered off. I happen to agree with libvirt, the VM is happily running – the news of its demise is grossly exaggerated, but oned won’t have it.

How do I convince oned is wrong and to accept the truth that the VM is alive and well?