Hi guys,
working on OpenNebula 5.0.2 I noticed strange behaviors attaching/detaching disks on my Virtual Machines.
Here tests I ran on two vms:
TEST 1
I attached a non-persistent image to a VM.
This is how appeared my vm directory /var/lib/one/datastores/0/893:
-rw-rw-r–. 1 oneadmin oneadmin 1079 Oct 10 11:16 deployment.0
lrwxrwxrwx. 1 oneadmin oneadmin 13 Oct 10 11:16 disk.0 -> disk.0.snap/0
drwxrwxr-x. 2 oneadmin oneadmin 4 Oct 10 11:16 disk.0.snap
-rw-r–r--. 1 oneadmin oneadmin 372736 Oct 10 11:16 disk.1 lrwxrwxrwx. 1 oneadmin oneadmin 13 Oct 10 12:16 disk.2 -> disk.2.snap/0 drwxrwxr-x. 2 oneadmin oneadmin 4 Oct 10 12:16 disk.2.snap
When I detached the image, the symbolic link is destroyed but snapshot still exists:
-rw-rw-r–. 1 oneadmin oneadmin 1079 Oct 10 11:16 deployment.0
lrwxrwxrwx. 1 oneadmin oneadmin 13 Oct 10 11:16 disk.0 -> disk.0.snap/0
drwxrwxr-x. 2 oneadmin oneadmin 4 Oct 10 11:16 disk.0.snap
-rw-r–r--. 1 oneadmin oneadmin 372736 Oct 10 11:16 disk.1 drwxrwxr-x. 2 oneadmin oneadmin 4 Oct 10 12:16 disk.2.snap
I got following errors trying to attach on the same vm a disk:
-attaching non-persistent image: Mon Oct 10 12:17:37 2016 : Error attaching new VM Disk: Error copying myserver:/var/lib/one//datastores/101/85f115b6bff8f8c97600d68817743c20 to myhost:/var/lib/one//datastores/0/893/disk.2
-attaching persistent image: Mon Oct 10 12:18:38 2016 : Error attaching new VM Disk: Error linking myserver:/var/lib/one//datastores/101/85f115b6bff8f8c97600d68817743c20 to myhost:/var/lib/one//datastores/0/893/disk.2
TEST 2
I attached a persistent image to a VM.
This is how appeared my vm directory /var/lib/one/datastores/0/894:
-rw-rw-r–. 1 oneadmin oneadmin 1079 Oct 10 11:16 deployment.0
lrwxrwxrwx. 1 oneadmin oneadmin 13 Oct 10 11:16 disk.0 -> disk.0.snap/0
drwxrwxr-x. 2 oneadmin oneadmin 4 Oct 10 11:16 disk.0.snap
-rw-r–r--. 1 oneadmin oneadmin 372736 Oct 10 11:16 disk.1 lrwxrwxrwx. 1 oneadmin oneadmin 67 Oct 10 12:24 disk.2 -> /var/lib/one/datastores/101/85f115b6bff8f8c97600d68817743c20.snap/0 lrwxrwxrwx. 1 oneadmin oneadmin 65 Oct 10 12:24 disk.2.snap -> /var/lib/one/datastores/101/85f115b6bff8f8c97600d68817743c20.snap
When I detached the image, both files created (disk.2 and disk.2.snap) were still present in the vm directory.
If I attached a new persistent disk link were overwritten and disk was attach correctly.
If I attached non persistent image I got the following error:
Mon Oct 10 12:28:26 2016 : Error attaching new VM Disk: Error copying myserver:/var/lib/one//datastores/1/85f115b6bff8f8c97600d68817743c20 to myhost:/var/lib/one//datastores/0/894/disk.2
and this is how appeared the vm directory /var/lib/one/datastores/0/894:
-rw-rw-r–. 1 oneadmin oneadmin 1079 Oct 10 11:16 deployment.0
lrwxrwxrwx. 1 oneadmin oneadmin 13 Oct 10 11:16 disk.0 -> disk.0.snap/0
drwxrwxr-x. 2 oneadmin oneadmin 4 Oct 10 11:16 disk.0.snap
-rw-r–r--. 1 oneadmin oneadmin 372736 Oct 10 11:16 disk.1 lrwxrwxrwx. 1 oneadmin oneadmin 13 Oct 10 12:26 disk.2 -> disk.2.snap/0 lrwxrwxrwx. 1 oneadmin oneadmin 65 Oct 10 12:24 disk.2.snap -> /var/lib/one/datastores/101/85f115b6bff8f8c97600d68817743c20.snap
My datastores TM_MAD attribute is qcow2.
Seems that there is something wrong during disk detaching, files doesn’t clean up correctly.
There should be some more lines in the logs before or after the quoted error messages describing the exact reason for the failure.
I am only reading the scripts and think the following changes should fix the reported behavior. Can you try the following changes?
Edit /var/lib/one/remotes/tm/qcow2/clone and replace
mkdir -p ${DST_PATH}.snap
rm -f $DST_PATH
with
rm -rf $DST_PATH{,snap}
mkdir -p ${DST_PATH}.snap
I think this will fix the non-persistent disk attachment after persistent disk detach. Reading the code I think it is possible to have data corruption on the persistent image as it is probably overwritten with the non-persistent image source by qemu-img. Not sure without knowing which command of the script failed.
To proper clean-up the artifacts after detaching non-persistent disk, please try editing /var/lib/one/remotes/tm/qcow2/delete replacing:
Hi Anton,
I applied the changes you suggested but I still get the same error below:
but instead getting this one regarding ln script :
I get the following and the vm goes in poweroff state:
Tue Oct 11 08:21:09 2016 [Z0][VMM][I]: Command execution fail: cat << EOT | /var/tmp/one/vmm/kvm/attach_disk ‘one-913’ ‘/var/lib/one//datastores/0/913/disk.2’ ‘vdb’ ‘1’ ‘DATA’ 913 myhost
Tue Oct 11 08:21:09 2016 [Z0][VMM][E]: attach_disk: Command “virsh --connect qemu:///system attach-device one-913 /var/lib/one//datastores/0/913/disk.2.attach” failed: error: Failed to attach device from /var/lib/one//datastores/0/913/disk.2.attach
Tue Oct 11 08:21:09 2016 [Z0][VMM][I]: error: Unable to read from monitor: Connection reset by peer
Tue Oct 11 08:21:09 2016 [Z0][VMM][E]: Could not attach /var/lib/one//datastores/0/913/disk.2 (vdb) to one-913
Tue Oct 11 08:21:09 2016 [Z0][VMM][I]: ExitCode: 1
Tue Oct 11 08:21:09 2016 [Z0][VMM][I]: Successfully execute transfer manager driver operation: tm_detach.
Tue Oct 11 08:21:09 2016 [Z0][VMM][I]: Failed to execute virtualization driver operation: attach_disk.
Tue Oct 11 08:21:09 2016 [Z0][VMM][E]: Error attaching new VM Disk: Could not attach /var/lib/one//datastores/0/913/disk.2 (vdb) to one-913
Tue Oct 11 08:21:09 2016 [Z0][VM][I]: New LCM state is RUNNING
Tue Oct 11 08:21:51 2016 [Z0][LCM][I]: VM running but monitor state is POWEROFF
Tue Oct 11 08:21:51 2016 [Z0][VM][I]: New LCM state is SHUTDOWN_POWEROFF
Tue Oct 11 08:21:51 2016 [Z0][VM][I]: New state is POWEROFF
Tue Oct 11 08:21:51 2016 [Z0][VM][I]: New LCM state is LCM_INIT
Hi Anton,
I was doing same tests with “shared” transport mode and I noticed a very bad thing.
The clone script in /var/lib/one/remotes/tm/shared/clone doesn’t have
rm -f $DST_PATH
as the one for qcow2 tm.
This means that, if you attach to a persistent disk to a vm, detach it and attach another non-persistent one, you destroy the persistent disk (previosly detached) because the symbolic link still exist in my VM directory /var/lib/one/datastore/0/VMID.
I don’t get any error in the logs because the disk is mounted “correctly”.
I tried to add the remove command in the clone script and resolve the bug.
Can you take a look please?
Best Regards,
Valentina
I would like to note that this patch will work but the root cause is that there is no cleanup after persistent disk detach. The proper fix is to develop mvds script that needs more time and testing.
Just an update, we are going to keep the original patches from Anton (in fact some rm operations were already in place). We prefer not to perform additional calls to ssh in mvds, and protect the disks from potential failures executing mvds. So the idea is remove any target link before performing any destructive operation on the target links.
The affected scripts are:
qcow2/clone
qcow2/ln
shared/clone
shared/ln
(Note that some of them were already working as described)