How to recover BOOT_UNDEPLOY_FAILURE?

Hi,

A user undeployed his VM, but it failed because of storage problems at the VM host (atleast that’s what I think/thought). I deleted a VM on the host, but a retry does not help. The VM stays stuck at BOOT_UNDEPLOY_FAILURE and a --recover --interactive shows me an error that BOOT_UNDEPLOY_FAILURE does not support these options.
On the target host I can see the disk file, however it is only a few KBs in size. On the previous host, where it was stored during undeployment, I can see the 18GB disk file. Can somebody help me to recover the VM either into running or is there a way to safely undeploy it again (undoing the failed deploy)?

The log says:
Mon Jan 31 10:48:13 2022 [Z0][VMM][I]: Command execution fail: cat << EOT | /var/tmp/one/vmm/kvm/deploy ‘/var/lib/one//datastores/101/237/deployment.4’ ‘host1’ 237 host1
Mon Jan 31 10:48:13 2022 [Z0][VMM][I]: error: Failed to create domain from /var/lib/one//datastores/101/237/deployment.4
Mon Jan 31 10:48:13 2022 [Z0][VMM][I]: error: Cannot access storage file ‘/var/lib/one//datastores/101/237/disk.0’ (as uid:9869, gid:9869): No such file or directory
Mon Jan 31 10:48:13 2022 [Z0][VMM][E]: Could not create domain from /var/lib/one//datastores/101/237/deployment.4

Thanks

Unfortunately you may be hit by this one: VM may lose qcow2 disk after undeploy/resume · Issue #5702 · OpenNebula/one · GitHub

There is a tentative patch (that you can apply) linked to the issue

Can you post the VM directory content from the host?

ls -la /var/lib/one/datastores/101/237/

Hi,

In the meantime I resolved the problem myself. I looked at other running VMs and saw that they have a disk.0 image with variable size and a rather small disk.1 image. For the stuck VM, only disk.1 was copied but not disk.0. I manually copied disk.0 from the previous host to the one the VM was deployed too. I moved it into the datastore and appropriate folder. After changing the rights and ownership of the copied disk to the opennebula user on the machine, I was able to press the recover button and it started up. So I think the problem here was, that Opennebula thought that the files are already on the other system but they weren’t, probably because the file system was full before.

Is my fix okay, or may this break more stuff in the future (e.g., when the VM gets undeployed) ?