Launching ephemeral VMs fails with permission denied on 5.12.0.3

Hello,

I’ve been investigating a weird issue for the last couple of days and can’t really figure it out: I’m on Debian 10, and ONE 5.12.0.3.

When trying to spin up an ephemeral VM, launch fails with this error in ONE:

Sun Jan 17 11:34:30 2021 : Error deploying virtual machine: Could not create domain from /var/lib/one/datastores/100/89/deployment.0

oned.log has some more details:

Sun Jan 17 11:34:30 2021 [Z0][VMM][D]: Message received: LOG I 89 Command execution fail: cat << EOT | /var/tmp/one/vmm/kvm/deploy '/var/lib/one/datastores/100/89/deployment.0' 'MYHOSTNAME' 89 MYHOSTNAME
Sun Jan 17 11:34:30 2021 [Z0][VMM][D]: Message received: LOG I 89 error: Failed to create domain from /var/lib/one/datastores/100/89/deployment.0
Sun Jan 17 11:34:30 2021 [Z0][VMM][D]: Message received: LOG I 89 error: internal error: process exited while connecting to monitor: 2021-01-17T11:34:30.046811Z qemu-system-x86_64: -drive file=/var/lib/one/datastores/100/89/disk.0,format=qcow2,if=none,id=drive-virtio-disk0,cache=writeback: Could not open backing file: Could not open '/var/lib/one/datastores/103/65ad126e4ee862a2f85c00fe466b515b': Permission denied
Sun Jan 17 11:34:30 2021 [Z0][VMM][D]: Message received: LOG E 89 Could not create domain from /var/lib/one/datastores/100/89/deployment.0

I can create a new persistent VM from the same template without issues. Also, this happens with existing templates and new ones freshly downloaded from the Marketplace.

Permissions don’t seem to have anything wrong: qemu runs as oneadmin, and this user is able to access /var/lib/one/datastores/103/65ad126e4ee862a2f85c00fe466b515b.

Topics on this forum about similar issues suggested to:

  • disable AppArmor -> I don’t have it
  • check /etc/libvirt/qemu.conf is correct -> it is

Any help or hint would be appreciated. Happy to share more details in case I’ve missed something.

Thanks
Giorgio

Hi, it can be some issue with system datastore. When you are running vm with persistent image, then image datastore is used. Can you share your datastores templates?

Not sure I know what a “datastore template” is?

Here are my datastores anyway:

ID,USER,GROUP,NAME,SIZE,AVAIL,CLUSTERS,IMAGES,TYPE,DS,TM,STAT
103,oneadmin,oneadmin,templates,25G,74%,100,8,img,fs,qcow2,on
102,oneadmin,oneadmin,tier2-images,591.5G,24%,100,5,img,fs,qcow2,on
101,oneadmin,oneadmin,tier1-images,97.9G,68%,100,7,img,fs,qcow2,on
100,oneadmin,oneadmin,system,25G,100%,100,0,sys,-,qcow2,on

Note the error seems to be on “103”. Permissions seem to be the same across all of them:

drwxr-x---  7 oneadmin oneadmin 4.0K Dec  2  2019 .
drwxr-x--- 12 oneadmin oneadmin 4.0K Jan 17 16:14 ..
drwxr-xr-x 10 oneadmin oneadmin   10 Jan 17 11:53 100
drwxr-xr-x 10 oneadmin oneadmin   17 Jan 16 15:28 101
drwxr-xr-x  8 oneadmin oneadmin   13 Jan 16 15:32 102
drwxr-xr-x  3 oneadmin oneadmin   11 Jan 17 11:54 103
drwxr-xr-x  2 oneadmin oneadmin 4.0K Jan 17 16:24 .isofiles

And mounts:

tank-tier1/images on /var/lib/one/datastores/101 type zfs (rw,xattr,noacl)
tank-tier1/templates on /var/lib/one/datastores/103 type zfs (rw,xattr,noacl)
tank-tier1/system on /var/lib/one/datastores/100 type zfs (rw,xattr,noacl)
tank-tier2/images on /var/lib/one/datastores/102 type zfs (rw,xattr,noacl)

Thanks!
Giorgio

What is output of:

  • ls -la /var/lib/one/datastores/103/65ad126e4ee862a2f85c00fe466b515b
  • qemu-img info /var/lib/one/datastores/100/89/disk.0

Apologies for the delay, I missed the notification here. Unfortunately I long-deleted VM 89, so will try again and paste a fresh set of outputs (which will obviously not be consistent with the above).

Steps:
1 - Use a clean template downloaded from marketplace
2 - Instantiate a persistent VM from it, confirm it works
3 - Instantiate an ephemeral VM, confirm it fails

Error output:
Sun Mar 14 11:14:00 2021 : Error deploying virtual machine: Could not create domain from /var/lib/one/datastores/100/94/deployment.0

oned.log:

Sun Mar 14 11:13:59 2021 [Z0][VMM][D]: Message received: LOG I 94 Successfully execute transfer manager driver operation: tm_context.
Sun Mar 14 11:13:59 2021 [Z0][VMM][D]: Message received: LOG I 94 Successfully execute network driver operation: pre.
Sun Mar 14 11:14:00 2021 [Z0][VMM][D]: Message received: LOG I 94 Command execution fail: cat << EOT | /var/tmp/one/vmm/kvm/deploy '/var/lib/one/datastores/100/94/deployment.0' 'my-za' 94 my-za
Sun Mar 14 11:14:00 2021 [Z0][VMM][D]: Message received: LOG I 94 error: Failed to create domain from /var/lib/one/datastores/100/94/deployment.0
Sun Mar 14 11:14:00 2021 [Z0][VMM][D]: Message received: LOG I 94 error: internal error: process exited while connecting to monitor: 2021-03-14T11:14:00.264699Z qemu-system-x86_64: -drive file=/var/lib/one/datastores/100/94/disk.0,format=qcow2,if=none,id=drive-virtio-disk0,cache=writeback: Could not open backing file: Could not open '/var/lib/one/datastores/103/2653eac24c15f35c043e7bcde31c7a51': Permission denied
Sun Mar 14 11:14:00 2021 [Z0][VMM][D]: Message received: LOG E 94 Could not create domain from /var/lib/one/datastores/100/94/deployment.0
Sun Mar 14 11:14:00 2021 [Z0][VMM][D]: Message received: LOG I 94 ExitCode: 255
Sun Mar 14 11:14:00 2021 [Z0][VMM][D]: Message received: LOG I 94 Successfully execute network driver operation: clean.
Sun Mar 14 11:14:00 2021 [Z0][VMM][D]: Message received: LOG I 94 Failed to execute virtualization driver operation: deploy.
Sun Mar 14 11:14:00 2021 [Z0][VMM][D]: Message received: DEPLOY FAILURE 94 Could not create domain from /var/lib/one/datastores/100/94/deployment.0

Permissions for /var/lib/one/datastores/103/2653eac24c15f35c043e7bcde31c7a51:

File: /var/lib/one/datastores/103/2653eac24c15f35c043e7bcde31c7a51
Size: 367853568 	Blocks: 710913     IO Block: 131072 regular file
Device: 2ch/44d	Inode: 260         Links: 1
Access: (0644/-rw-r--r--)  Uid: ( 9869/oneadmin)   Gid: ( 9869/oneadmin)
Access: 2021-03-14 11:14:00.213693350 +0000
Modify: 2021-01-17 11:54:24.316023082 +0000
Change: 2021-01-17 11:54:24.316023082 +0000

And qemu-img info /var/lib/one/datastores/100/94/disk.0:

image: /var/lib/one/datastores/100/94/disk.0
file format: qcow2
virtual size: 2.2G (2361393152 bytes)
disk size: 4.5K
cluster_size: 65536
backing file: /var/lib/one/datastores/103/2653eac24c15f35c043e7bcde31c7a51
backing file format: qcow2
Format specific information:
    compat: 1.1
    lazy refcounts: false
    refcount bits: 16
    corrupt: false

I’m a bit lost honestly. I’ve tried various solutions with permissions but the issue doesn’t seem to lie there. In my eyes, the fact persistent VMs from the same template are working means in some way all files are correct, permissions do too, but they are the only error we have.

I have high confidence this was broken with the 5.12 update.

What about SELinux, is it enabled? If yes, try to disable it.

I think this is not related to OpenNebula itself, but to qemu version installed on server.

You can try to manually create domain with ‘virsh create /path/to/deployment.xml’

I don’t have selinux at all, and my install is a vanilla, up to date Debian 10, fully managed by OpenNebula. There isn’t any weird config or custom compiled software, and the only “alien” package on the server is “frr” which I use to manage routing and obviously has nothing to do with creating domains.

Manually creating random domains works (even if, admittedly, I couldn’t figure out how to pull the exact command ONE uses so I can’t replicate the exact thing), as does creating non-ephemeral VMs from OpenNebula, which is the confusing bit to me.

OpenNebula generates xml deployment to this file /var/lib/one/datastores/100/94/deployment.0

so you can try to run virsh create /var/lib/one/datastores/100/94/deployment.0

you get same error, but you can start debuging

I just wanna to sit down in a corner and cry. My afternoon went like this:

0 - I kept working on the same two instances started from the same template, 93, persistent, working and 94, ephemeral, not starting.

1 - Started by running diff /var/lib/one/datastores/100/94/deployment.0 /var/lib/one/datastores/100/93/deployment.0, and confirmed the only relevant differences were:

< 			<source file='/var/lib/one/datastores/100/94/disk.0'/>
---
> 			<source file='/var/lib/one/datastores/100/93/disk.0'/>
22c22
< 			<source file='/var/lib/one/datastores/100/94/disk.1'/>
---
> 			<source file='/var/lib/one/datastores/100/93/disk.1'/>

2 - As next step I looked at the full path from every disk.0 to the two actual qcow2 files. This is 94:

root@za:~# ls -lha /var/lib/one/datastores/100/94/
total 11K
drwxr-xr-x  3 oneadmin oneadmin    6 Mar 14 11:13 .
drwxr-xr-x 12 oneadmin oneadmin   12 Mar 14 11:13 ..
-rw-r--r--  1 oneadmin oneadmin 1.4K Mar 14 11:13 deployment.0
lrwxrwxrwx  1 oneadmin oneadmin   13 Mar 14 11:13 disk.0 -> disk.0.snap/0
drwxr-xr-x  2 oneadmin oneadmin    4 Mar 14 11:13 disk.0.snap
-rw-r--r--  1 oneadmin oneadmin 364K Mar 14 11:13 disk.1
root@za:~# ls -lha /var/lib/one/datastores/100/94/disk.0.snap/
total 7.0K
drwxr-xr-x 2 oneadmin oneadmin    4 Mar 14 11:13 .
drwxr-xr-x 3 oneadmin oneadmin    6 Mar 14 11:13 ..
-rw-r--r-- 1 oneadmin oneadmin 193K Mar 14 11:13 0
lrwxrwxrwx 1 oneadmin oneadmin    1 Mar 14 11:13 disk.0.snap -> .

And this is 93:

root@za:~# ls -lha /var/lib/one/datastores/100/93/
total 11K
drwxr-xr-x  2 oneadmin oneadmin    6 Mar 14 11:13 .
drwxr-xr-x 12 oneadmin oneadmin   12 Mar 14 11:13 ..
-rw-r--r--  1 oneadmin oneadmin 1.4K Mar 14 11:13 deployment.0
lrwxrwxrwx  1 oneadmin oneadmin   67 Mar 14 11:13 disk.0 -> /var/lib/one/datastores/103/0bc61cd241e410f1adbad737d908f073.snap/0
lrwxrwxrwx  1 oneadmin oneadmin   65 Mar 14 11:13 disk.0.snap -> /var/lib/one/datastores/103/0bc61cd241e410f1adbad737d908f073.snap
-rw-r--r--  1 oneadmin oneadmin 364K Mar 14 11:13 disk.1
root@za:~# ls -lha /var/lib/one/datastores/103/0bc61cd241e410f1adbad737d908f073.snap/0
lrwxrwxrwx 1 oneadmin oneadmin 60 Mar 14 11:13 /var/lib/one/datastores/103/0bc61cd241e410f1adbad737d908f073.snap/0 -> /var/lib/one/datastores/103/0bc61cd241e410f1adbad737d908f073
root@za:~# ls -lha /var/lib/one/datastores/103/0bc61cd241e410f1adbad737d908f073
-rw-r--r-- 1 oneadmin oneadmin 404M Mar 19 16:25 /var/lib/one/datastores/103/0bc61cd241e410f1adbad737d908f073

3 - Permissions seemed to be ok top to bottom, so I double checked the img status:

root@za:~# qemu-img info /var/lib/one/datastores/100/93/disk.0
image: /var/lib/one/datastores/100/93/disk.0
file format: qcow2
virtual size: 2.2G (2361393152 bytes)
disk size: 353M
cluster_size: 65536
Format specific information:
    compat: 1.1
    lazy refcounts: false
    refcount bits: 16
    corrupt: false

root@za:~# qemu-img info /var/lib/one/datastores/100/94/disk.0
image: /var/lib/one/datastores/100/94/disk.0
file format: qcow2
virtual size: 2.2G (2361393152 bytes)
disk size: 4.5K
cluster_size: 65536
backing file: /var/lib/one/datastores/103/2653eac24c15f35c043e7bcde31c7a51
backing file format: qcow2
Format specific information:
    compat: 1.1
    lazy refcounts: false
    refcount bits: 16
    corrupt: false

4 - At this point I spent hours fiddling with any kind of permission. Out of hope, I changed ownership for the entire chain of files to a temporary user, updated qemu.conf and rebooted.

5 - Things were at this point way more broken than they used to be.

6 - Restored the original qemu.conf (effectively fully rolling back the initial status of my machine), rebooted again.


Everything started working. I have no explanation whatsoever for this. I checked all the permissions from my terminal log, and I’ve restored them exactly as they were previously. The only .conf file I’ve touched has been restored from local backup, so it’s the same.

The only thing which happened is a reboot.

Well, I guess all is well what ends well. Thanks a lot for your support.