Debugging domain creation

Executive Summary: OpenNebula environment grown over 12 years running OpenNebula v5.2, upgrading host OS breaks initial domain creation. Live migration works!


Versions of the related components and OS (frontend, hypervisors, VMs):

CentOS7 with qemu-kvm on 30 Intel-cpu (Supermicro) hosts. some hosts upgraded to Alma9, fully updated. VMs a mix of versions of FreeBSD, CentOS, Alma, Ubuntu, Windows. CentOS7 hosts have been operating properly for many years. Some Alma9 hosts are new, others are upgrades, all show the same behavior.

All datastores are NFSv4 with qcow2 images.

Steps to reproduce:

Instantiate a new VM, any OS, attempt to deploy to Alma9 node

Current results:

Domain creation fails with this in the log:

Wed Aug 6 13:04:41 2025 [Z0][VM][I]: New state is ACTIVE
Wed Aug 6 13:04:41 2025 [Z0][VM][I]: New LCM state is PROLOG
Wed Aug 6 13:04:43 2025 [Z0][VM][I]: New LCM state is BOOT
Wed Aug 6 13:04:43 2025 [Z0][VMM][I]: Generating deployment file: /var/lib/one/vms/1756/deployment.0
Wed Aug 6 13:04:43 2025 [Z0][VMM][I]: Successfully execute transfer manager driver operation: tm_context.
Wed Aug 6 13:04:43 2025 [Z0][VMM][I]: ExitCode: 0
Wed Aug 6 13:04:43 2025 [Z0][VMM][I]: Successfully execute network driver operation: pre.
Wed Aug 6 13:04:43 2025 [Z0][VMM][I]: Command execution fail: cat << EOT | /var/tmp/one/vmm/kvm/deploy '/var/lib/one//datastores/0/1756/deployment.0' '003-B01s02n2' 1756 003-B01s02n2
Wed Aug 6 13:04:43 2025 [Z0][VMM][I]: error: Failed to create domain from /var/lib/one//datastores/0/1756/deployment.0
Wed Aug 6 13:04:43 2025 [Z0][VMM][I]: error: Cannot access storage file '/var/lib/one//datastores/0/1756/disk.0' (as uid:9869, gid:9869): No such file or directory
Wed Aug 6 13:04:43 2025 [Z0][VMM][E]: Could not create domain from /var/lib/one//datastores/0/1756/deployment.0
Wed Aug 6 13:04:43 2025 [Z0][VMM][I]: ExitCode: 255
Wed Aug 6 13:04:43 2025 [Z0][VMM][I]: Failed to execute virtualization driver operation: deploy.
Wed Aug 6 13:04:43 2025 [Z0][VMM][E]: Error deploying virtual machine: Could not create domain from /var/lib/one//datastores/0/1756/deployment.0
Wed Aug 6 13:04:43 2025 [Z0][VM][I]: New LCM state is BOOT_FAILURE

The VM directory on datastore 0 is created with the proper symlink, but the actual disk image file is not created.

Identical VMs created on CentOS7 hosts can be live migrated to Alma9 hosts

Expected results:

Domain should be created. If multiple instances are created with the same template and boot image and deployed to a diversity of hosts, the CentOS7 hosts work, the Alma9 hosts do not.

Background:

This is a very old environment, initially set up in 2012 but not diligently maintained since 2017. I am attempting to rejuvenate it with a current OS and OpenNebula while it continues to host scores of active VMs.

Hi, probably offtopic, but why Alma? Oracle Linux is very good, offers kernel UEK, fast repos, fast security updates, live patching. We are running OL7 and OL9 for years without issues. You can also buy support, which is not so much expensive as RedHat.

I am sorry that you felt the need to use my concrete problem for unrelated evangelism.

I don’t use OL because I’ve dealt with Oracle substantially as a software vendor/developer over the past quarter century and will never choose that mistake again. I do not consider them a trustworthy business partner; they are a predator. I would retire rather than adding a reason to deal with them.

Alma offered in-place upgrading (which works!) all the way from CentOS7 to Alma9. If the path to Rocky9 was as clear, we would have used that.

1 Like

Hello @billcole,

The error message: no such file or directory means the disk from the image datastore didn’t transfer correctly to the system datastore.

Since you are using NFS, you should probably check the mounts. It might be that the system datastore isn’t mounted correctly or becomes temporarily unavailable.

Cheers,

The error message: no such file or directory means the disk from the image datastore didn’t transfer correctly to the system datastore.

Right. In this case (non-persistent root) it isn’t even really a transfer, just the creation of a qcow2 backing store file in the VM subdirectory of the System DS (0).

Since you are using NFS, you should probably check the mounts. It might be that the system datastore isn’t mounted correctly or becomes temporarily unavailable.

I wish either of those had been the proximate cause, but neither were. This is a very mature environment with processes that would catch missing mounts or NFS flakiness. I’ve been trying to debug this in between all my other work for ~8 weeks, so I covered a lot of things, such as making sure the globally-trusted oneadmin account was in fact globally trusted for cross-login and exactly the same sudo rights on all 30 machines.

However, I DID find the fix… [detailed in separate top-level response]

FIXED!

I managed to find the fix myself and since this basic question seems to have been asked often over the years but never answered well, I would write it up in detail…

Troubleshooting Process

That was the first error logged. The ‘deploy’ file is a shell script [deployed to the host] which calls the ‘virsh’ program (libvirt shell) with a ‘deployment.0’ file in the VM directory ([DS0]/<VMID>/). For the past week or so I wasted time trying to figure out where the input for ‘deploy’ was coming from, with no luck. This morning, I tried reproducing the error manually on the host by manually walking through what ‘deploy’ does, including the virsh command, from an interactive shell on the host, both as oneadmin and as root, both of which failed with the identical ‘no such file’ error as logged: i.e. virsh has no way to create the backing store file from what it is given, so the real failure came BEFORE the call to ‘deploy.’ It came from a failure to create a delta file.

Next Step: Find the Disk Creation

Q: What creates qcow2 delta files with backing stores?
A: The qemu-img utility.

To replicate the process, I tried to create the delta store in the same way OpenNebula must do somewhere: by running qemu-image via ssh as oneadmin from the frontend. This succeeded but I found that the Alma9 host was running a MUCH more recent version of qemu-img which demands an explicit “-F qcow2” argument or else it fails with the message “Backing file specified without backing format” which led me to the man page, where there is no mention of a ‘backing format’ on the frontend (CentOS7.) However, on the host (Alma9) “-F BACKING_FMT” is shown as a mandatory argument when creating a delta file. I was able to create the delta by adding that flag to the command run as oneadmin via sudo from the frontend and after that succeeded I was able to boot the VM via Sunstone with the ‘retry’ command.

To find where I could possibly add that argument into the command as run by OpenNebula, I searched the whole tree of scripts (/var/lib/one/remotes) and found it: /var/lib/one/remotes/tm/qcow2/clone:$QEMU_IMG create -b $SRC_PATH -f qcow2 $QCOW2_OPTIONS ${DST_PATH}.snap/0

No ‘-F’ but there is $QCOW2_OPTIONS which looks promising. As it turns out, that can be set in /etc/one/tmrc, Doing so on the frontend and restarting oned fixed the problem. This is the wrong fix, but it is working for me for the time being, because I have updated qemu-img on all the old machines. Ideally, QCOW2_OPTIONS should be changeable on a per-host basis, since it is possible to have different mandatory options in different versions.

The right fix is an academic question, as how this works may well have changed between 5.2 and 6.10, so anything I would hack into my 5.2 world would be a wasted effort.

This topic was automatically closed 2 days after the last reply. New replies are no longer allowed.