Problem: format of backing image not specified in the image metadata

With OpenNebula 5.10.x we could still use backing files without any struggle. We also used Ubuntu 18.04 as the OS running both OpenNebula and the nodes.

Now with 5.12.x and after upgrading to Ubuntu 20.04 I stumbled upon the problem that while launching a VM as non-persistent worked, launching a persistent one failed.

The error message was:

error: Failed to create domain from deployment.0
error: Requested operation is not valid: format of backing image ‘/var/lib/one/datastores/100/qtci-windows-10-x86_64-51.qcow2’ of image ‘/var/lib/one//datastores/0/116/disk.0’ was not specified in the image metadata (See https://libvirt.org/kbase/backing_chains.html for troubleshooting)

I came up with a hack to get by this problem. In ln.ssh I added qemu-img rebase for the local file:
BACKING_FILE=(qemu-img info {SRC_PATH} | grep backing | awk '{print 3}') and later on qemu-img rebase "{DST_SNAP_DIR}/0" -F qcow2 -b “${BACKING_FILE}”

So my question is now, what has changed and where? Is it libvirt that has changed in Ubuntu 20.04 that now requires some metadata not being created properly when the file “0” is being created?

Oh, perhaps useful data is also that when we create these files using backing files, we
generate them in the data storage, so that the next host has them readily available as soon as it has been created. So the file “0” just points back to the datastore as such:
0 -> /var/lib/one/datastores/100/0335420a0bd4c2b52d2a1428fb595691

Here we have a similar thing going on, where I actually got the idea of testing rebase: https://github.com/code-ready/crc/issues/1596

Looks like the only think needed is to add -F qcow2 to clone.ssh, but I kept modifying it on the build node all the time and not on the opennebula host itself… too many variables in this thing for my brain :smiley:

I couldn’t reproduce this error. Could you sum up once again?

The issue is related to persistent images correct? Why do you need to change clone.ssh then?

Do you keep some other modifications or just common qcow2 + TM_MAD_SYSTEM=“ssh” setup?

I’ll try to gather up all relevant information here.
Our image_ds is a shared drive with ID 100. It uses DS_MAD qcow2, TM_MAD qcow2 and TM_MAD_SYSTEM ssh.
Our system_ds is with the ID 0. It uses TM_MAD ssh. So that’s basically the local directory of every building host.

We provision our virtual machines, from being a tier1 with nothing installed, to a tier2 having all compilers and tools installed. These tier2 images are then being used in our builds on different hosts. When our CI launches something called provisioning, it creates a new VM that is persistent and uses the tier1 image as a backing file. To achieve this, we modified your qemu driver just so slightly. We end up with a local drive on the host, still under the same folder ~/datastores/0//disk.0.snap/0, but that one is a link to /var/lib/one/datastores/100/ that is a delta file containing the new data we install and it has some distro /var/lib/one/datastores/100/ubuntu_20.04.qcow2 as a backing file.

The main difference we had to do to achieve this was to modify ln.ssh:

diff qcow2/ln.ssh qcow2_backing/ln.ssh
70c70
< cp {SRC_SNAP_DIR}/\$F {DST_SNAP_DIR}
---
> ln -s {SRC_SNAP_DIR}/\$F {DST_SNAP_DIR}
76c76
< CP_CMD=“cp {SRC_PATH} {DST_SNAP_DIR}/0”
---
> CP_CMD=“ln -s {SRC_PATH} {DST_SNAP_DIR}/0”

As you can see, for diff purposes we created our version of the qcow driver called qcow2_backing.
Naturally we also modified mvds.ssh to rip out all the parts that move the image back to the datastore.

Now all this works with 5.10 with Ubuntu 18.04, but not 5.12 with Ubuntu 20.04. But as I added the “-F qcow2” to the create command, it works…or our CI works. However, if I go into OpenNebula’s GUI and instantiate a template as persistent, that will fail again on the same issue. I’ve yet to find out which files it actually calls upon so that I can fix them. No matter which file I try to debug and add garbage in them, the system constantly works the same…

Damit, I forgot one thing entirely.

We also modify remotes/datastore/fs (or actually create a copy again called qcow2_backup)

diff fs/clone qcow2_backing/clone
77d76
< log “Copying local image $SRC to the image repository”
79c78,80
< exec_and_log “cp -f $SRC $DST” “Error copying $SRC to $DST”
---
> log “Creating qcow2 backing file (linked clone) $DST from $SRC”
> # It’s a read-only and thus immutable qcow2 image. Then we can use a backing file.
> exec_and_log “$QEMU_IMG create -b $SRC -f qcow2 -F qcow2 $DST” “Error creating qcow2 backing file (linked clone) from $SRC to $DST”

This was the key thing I forgot to update in the version switch. I have to add the “-F qcow2” there as well. I think I got it now, sorry for bothering you. But hopefully someone finds this approach useful as well :slight_smile: At least I went through and explain it all now :smiley:

1 Like