OpenNebula 5.0.2 Attach/Detach images - datastore folder not cleaned up

Hi guys,
working on OpenNebula 5.0.2 I noticed strange behaviors attaching/detaching disks on my Virtual Machines.
Here tests I ran on two vms:

TEST 1
I attached a non-persistent image to a VM.
This is how appeared my vm directory /var/lib/one/datastores/0/893:
-rw-rw-r–. 1 oneadmin oneadmin 1079 Oct 10 11:16 deployment.0
lrwxrwxrwx. 1 oneadmin oneadmin 13 Oct 10 11:16 disk.0 -> disk.0.snap/0
drwxrwxr-x. 2 oneadmin oneadmin 4 Oct 10 11:16 disk.0.snap
-rw-r–r--. 1 oneadmin oneadmin 372736 Oct 10 11:16 disk.1
lrwxrwxrwx. 1 oneadmin oneadmin 13 Oct 10 12:16 disk.2 -> disk.2.snap/0
drwxrwxr-x. 2 oneadmin oneadmin 4 Oct 10 12:16 disk.2.snap

When I detached the image, the symbolic link is destroyed but snapshot still exists:
-rw-rw-r–. 1 oneadmin oneadmin 1079 Oct 10 11:16 deployment.0
lrwxrwxrwx. 1 oneadmin oneadmin 13 Oct 10 11:16 disk.0 -> disk.0.snap/0
drwxrwxr-x. 2 oneadmin oneadmin 4 Oct 10 11:16 disk.0.snap
-rw-r–r--. 1 oneadmin oneadmin 372736 Oct 10 11:16 disk.1
drwxrwxr-x. 2 oneadmin oneadmin 4 Oct 10 12:16 disk.2.snap

I got following errors trying to attach on the same vm a disk:
-attaching non-persistent image:
Mon Oct 10 12:17:37 2016 : Error attaching new VM Disk: Error copying myserver:/var/lib/one//datastores/101/85f115b6bff8f8c97600d68817743c20 to myhost:/var/lib/one//datastores/0/893/disk.2
-attaching persistent image:
Mon Oct 10 12:18:38 2016 : Error attaching new VM Disk: Error linking myserver:/var/lib/one//datastores/101/85f115b6bff8f8c97600d68817743c20 to myhost:/var/lib/one//datastores/0/893/disk.2

TEST 2
I attached a persistent image to a VM.
This is how appeared my vm directory /var/lib/one/datastores/0/894:
-rw-rw-r–. 1 oneadmin oneadmin 1079 Oct 10 11:16 deployment.0
lrwxrwxrwx. 1 oneadmin oneadmin 13 Oct 10 11:16 disk.0 -> disk.0.snap/0
drwxrwxr-x. 2 oneadmin oneadmin 4 Oct 10 11:16 disk.0.snap
-rw-r–r--. 1 oneadmin oneadmin 372736 Oct 10 11:16 disk.1
lrwxrwxrwx. 1 oneadmin oneadmin 67 Oct 10 12:24 disk.2 -> /var/lib/one/datastores/101/85f115b6bff8f8c97600d68817743c20.snap/0
lrwxrwxrwx. 1 oneadmin oneadmin 65 Oct 10 12:24 disk.2.snap -> /var/lib/one/datastores/101/85f115b6bff8f8c97600d68817743c20.snap

When I detached the image, both files created (disk.2 and disk.2.snap) were still present in the vm directory.

If I attached a new persistent disk link were overwritten and disk was attach correctly.
If I attached non persistent image I got the following error:
Mon Oct 10 12:28:26 2016 : Error attaching new VM Disk: Error copying myserver:/var/lib/one//datastores/1/85f115b6bff8f8c97600d68817743c20 to myhost:/var/lib/one//datastores/0/894/disk.2
and this is how appeared the vm directory /var/lib/one/datastores/0/894:
-rw-rw-r–. 1 oneadmin oneadmin 1079 Oct 10 11:16 deployment.0
lrwxrwxrwx. 1 oneadmin oneadmin 13 Oct 10 11:16 disk.0 -> disk.0.snap/0
drwxrwxr-x. 2 oneadmin oneadmin 4 Oct 10 11:16 disk.0.snap
-rw-r–r--. 1 oneadmin oneadmin 372736 Oct 10 11:16 disk.1
lrwxrwxrwx. 1 oneadmin oneadmin 13 Oct 10 12:26 disk.2 -> disk.2.snap/0
lrwxrwxrwx. 1 oneadmin oneadmin 65 Oct 10 12:24 disk.2.snap -> /var/lib/one/datastores/101/85f115b6bff8f8c97600d68817743c20.snap

My datastores TM_MAD attribute is qcow2.
Seems that there is something wrong during disk detaching, files doesn’t clean up correctly.

Can you help me out?

Hi Valentina,

There should be some more lines in the logs before or after the quoted error messages describing the exact reason for the failure.

I am only reading the scripts and think the following changes should fix the reported behavior. Can you try the following changes?

Edit /var/lib/one/remotes/tm/qcow2/clone and replace

mkdir -p ${DST_PATH}.snap
rm -f $DST_PATH

with

rm -rf $DST_PATH{,snap}
mkdir -p ${DST_PATH}.snap

I think this will fix the non-persistent disk attachment after persistent disk detach. Reading the code I think it is possible to have data corruption on the persistent image as it is probably overwritten with the non-persistent image source by qemu-img. Not sure without knowing which command of the script failed.

To proper clean-up the artifacts after detaching non-persistent disk, please try editing /var/lib/one/remotes/tm/qcow2/delete replacing:

function=“rm -rf $DST_PATH”

with

function=“rm -rf $DST_PATH{,snap}”

Kind Regards,
Anton Todorov

Hello Anton,
I cheked out logs and errors come from clone and ln scripts as you can see below.

About the test1, non-persistent disk attachment after non-persistent disk detach:

Tue Oct 11 06:42:51 2016 [Z0][VMM][I]: Command execution fail: /var/lib/one/remotes/tm/qcow2/clone myfrontend:/var/lib/one//datastores/101/85f115b6bff8f8c97600d68817743c20 myhost:/var/lib/one//datastores/0/910/disk.2 910 101
Tue Oct 11 06:42:51 2016 [Z0][VMM][I]: /var/lib/one/remotes/tm/qcow2/clone: line 101: [: : integer expression expected
Tue Oct 11 06:42:51 2016 [Z0][VMM][I]: clone: Cloning /var/lib/one/datastores/101/85f115b6bff8f8c97600d68817743c20 in myhost:/var/lib/one//datastores/0/910/disk.2
Tue Oct 11 06:42:51 2016 [Z0][VMM][E]: clone: Command “cd /var/lib/one/datastores/0/910
Tue Oct 11 06:42:51 2016 [Z0][VMM][I]: mkdir -p /var/lib/one/datastores/0/910/disk.2.snap
Tue Oct 11 06:42:51 2016 [Z0][VMM][I]: rm -f /var/lib/one/datastores/0/910/disk.2
Tue Oct 11 06:42:51 2016 [Z0][VMM][I]: qemu-img create -b /var/lib/one/datastores/101/85f115b6bff8f8c97600d68817743c20 -f qcow2 /var/lib/one/datastores/0/910/disk.2.snap/0
Tue Oct 11 06:42:51 2016 [Z0][VMM][I]: ln -s disk.2.snap/0 /var/lib/one/datastores/0/910/disk.2
Tue Oct 11 06:42:51 2016 [Z0][VMM][I]: cd /var/lib/one/datastores/0/910/disk.2.snap
Tue Oct 11 06:42:51 2016 [Z0][VMM][I]: ln -s . /var/lib/one/datastores/0/910/disk.2.snap/disk.2.snap” failed: ln: failed to create symbolic link ‘/var/lib/one/datastores/0/910/disk.2.snap/disk.2.snap/.’: File exists
Tue Oct 11 06:42:51 2016 [Z0][VMM][E]: Error copying myfrontend:/var/lib/one//datastores/101/85f115b6bff8f8c97600d68817743c20 to myhost:/var/lib/one//datastores/0/910/disk.2
Tue Oct 11 06:42:51 2016 [Z0][VMM][I]: ExitCode: 1
Tue Oct 11 06:42:51 2016 [Z0][VMM][I]: Failed to execute transfer manager driver operation: tm_attach.

About the test1, persistent disk attachment after non-persistent disk detach:

Tue Oct 11 06:44:57 2016 [Z0][VMM][I]: Command execution fail: /var/lib/one/remotes/tm/qcow2/ln myfrontend:/var/lib/one//datastores/1/6f1f6f41c86843157b6a9ba30c1e10cd myhost:/var/lib/one//datastores/0/910/disk.2 910 1
Tue Oct 11 06:44:57 2016 [Z0][VMM][I]: ln: Linking /var/lib/one/datastores/1/6f1f6f41c86843157b6a9ba30c1e10cd in myhost:/var/lib/one//datastores/0/910/disk.2
Tue Oct 11 06:44:57 2016 [Z0][VMM][E]: ln: Command "set -ex
Tue Oct 11 06:44:57 2016 [Z0][VMM][I]: mkdir -p "/var/lib/one/datastores/1/6f1f6f41c86843157b6a9ba30c1e10cd.snap"
Tue Oct 11 06:44:57 2016 [Z0][VMM][I]: if [ ! -L “/var/lib/one/datastores/1/6f1f6f41c86843157b6a9ba30c1e10cd.snap/6f1f6f41c86843157b6a9ba30c1e10cd.snap” ]; then
Tue Oct 11 06:44:57 2016 [Z0][VMM][I]: ln -s “/var/lib/one/datastores/1/6f1f6f41c86843157b6a9ba30c1e10cd.snap” “/var/lib/one/datastores/1/6f1f6f41c86843157b6a9ba30c1e10cd.snap/6f1f6f41c86843157b6a9ba30c1e10cd.snap"
Tue Oct 11 06:44:57 2016 [Z0][VMM][I]: fi
Tue Oct 11 06:44:57 2016 [Z0][VMM][I]: SNAP=”$(ls /var/lib/one/datastores/1/6f1f6f41c86843157b6a9ba30c1e10cd.snap | grep '^[[:digit:]]$’ | sort -n | tail -n 1)"
Tue Oct 11 06:44:57 2016 [Z0][VMM][I]: if [ -z “${SNAP}” ]; then
Tue Oct 11 06:44:57 2016 [Z0][VMM][I]: SNAP=0
Tue Oct 11 06:44:57 2016 [Z0][VMM][I]: ln -sf “/var/lib/one/datastores/1/6f1f6f41c86843157b6a9ba30c1e10cd” "/var/lib/one/datastores/1/6f1f6f41c86843157b6a9ba30c1e10cd.snap/0"
Tue Oct 11 06:44:57 2016 [Z0][VMM][I]: fi
Tue Oct 11 06:44:57 2016 [Z0][VMM][I]: ln -sf “/var/lib/one/datastores/1/6f1f6f41c86843157b6a9ba30c1e10cd.snap/$SNAP” "/var/lib/one/datastores/0/910/disk.2"
Tue Oct 11 06:44:57 2016 [Z0][VMM][I]: rm -f “/var/lib/one/datastores/0/910/disk.2.snap"
Tue Oct 11 06:44:57 2016 [Z0][VMM][I]: ln -sf “/var/lib/one/datastores/1/6f1f6f41c86843157b6a9ba30c1e10cd.snap” “/var/lib/one/datastores/0/910/disk.2.snap”” failed: + mkdir -p /var/lib/one/datastores/1/6f1f6f41c86843157b6a9ba30c1e10cd.snap
Tue Oct 11 06:44:57 2016 [Z0][VMM][I]: + ‘[’ ‘!’ -L /var/lib/one/datastores/1/6f1f6f41c86843157b6a9ba30c1e10cd.snap/6f1f6f41c86843157b6a9ba30c1e10cd.snap ']'
Tue Oct 11 06:44:57 2016 [Z0][VMM][I]: ++ ls /var/lib/one/datastores/1/6f1f6f41c86843157b6a9ba30c1e10cd.snap
Tue Oct 11 06:44:57 2016 [Z0][VMM][I]: ++ grep '^[[:digit:]]
$'
Tue Oct 11 06:44:57 2016 [Z0][VMM][I]: ++ sort -n
Tue Oct 11 06:44:57 2016 [Z0][VMM][I]: ++ tail -n 1
Tue Oct 11 06:44:57 2016 [Z0][VMM][I]: + SNAP=0
Tue Oct 11 06:44:57 2016 [Z0][VMM][I]: + ‘[’ -z 0 ']'
Tue Oct 11 06:44:57 2016 [Z0][VMM][I]: + ln -sf /var/lib/one/datastores/1/6f1f6f41c86843157b6a9ba30c1e10cd.snap/0 /var/lib/one/datastores/0/910/disk.2
Tue Oct 11 06:44:57 2016 [Z0][VMM][I]: + rm -f /var/lib/one/datastores/0/910/disk.2.snap
Tue Oct 11 06:44:57 2016 [Z0][VMM][I]: rm: cannot remove ‘/var/lib/one/datastores/0/910/disk.2.snap’: Is a directory
Tue Oct 11 06:44:57 2016 [Z0][VMM][E]: Error linking myhost:/var/lib/one//datastores/1/6f1f6f41c86843157b6a9ba30c1e10cd to myhost:/var/lib/one//datastores/0/910/disk.2
Tue Oct 11 06:44:57 2016 [Z0][VMM][I]: ExitCode: 1
Tue Oct 11 06:44:57 2016 [Z0][VMM][I]: Failed to execute transfer manager driver operation: tm_attach.

About test2, disk attachment after persistent disk detach:

Tue Oct 11 06:56:19 2016 [Z0][VMM][I]: Command execution fail: /var/lib/one/remotes/tm/qcow2/clone myfrontend:/var/lib/one//datastores/101/85f115b6bff8f8c97600d68817743c20 myhost:/var/lib/one//datastores/0/911/disk.2 911 101
Tue Oct 11 06:56:19 2016 [Z0][VMM][I]: /var/lib/one/remotes/tm/qcow2/clone: line 101: [: : integer expression expected
Tue Oct 11 06:56:19 2016 [Z0][VMM][I]: clone: Cloning /var/lib/one/datastores/101/85f115b6bff8f8c97600d68817743c20 in myhost:/var/lib/one//datastores/0/911/disk.2
Tue Oct 11 06:56:19 2016 [Z0][VMM][E]: clone: Command “cd /var/lib/one/datastores/0/911
Tue Oct 11 06:56:19 2016 [Z0][VMM][I]: mkdir -p /var/lib/one/datastores/0/911/disk.2.snap
Tue Oct 11 06:56:19 2016 [Z0][VMM][I]: rm -f /var/lib/one/datastores/0/911/disk.2
Tue Oct 11 06:56:19 2016 [Z0][VMM][I]: qemu-img create -b /var/lib/one/datastores/101/85f115b6bff8f8c97600d68817743c20 -f qcow2 /var/lib/one/datastores/0/911/disk.2.snap/0
Tue Oct 11 06:56:19 2016 [Z0][VMM][I]: ln -s disk.2.snap/0 /var/lib/one/datastores/0/911/disk.2
Tue Oct 11 06:56:19 2016 [Z0][VMM][I]: cd /var/lib/one/datastores/0/911/disk.2.snap
Tue Oct 11 06:56:19 2016 [Z0][VMM][I]: ln -s . /var/lib/one/datastores/0/911/disk.2.snap/disk.2.snap” failed: ln: failed to create symbolic link ‘/var/lib/one/datastores/0/911/disk.2.snap/disk.2.snap/.’: File exists
Tue Oct 11 06:56:19 2016 [Z0][VMM][E]: Error copying myhost:/var/lib/one//datastores/101/85f115b6bff8f8c97600d68817743c20 to myhost:/var/lib/one//datastores/0/911/disk.2
Tue Oct 11 06:56:19 2016 [Z0][VMM][I]: ExitCode: 1
Tue Oct 11 06:56:19 2016 [Z0][VMM][I]: Failed to execute transfer manager driver operation: tm_attach.

I will try to change clone script as you suggested and let you know.
Best Regards,
Valentina

Hi Valentina,

While checking the logs I’ve figured out that I was looking at the drivers of v5.2beta.

I think that to fix this you should patch the /var/lib/one/remotes/tm/qcow2/ln too:

$RM -f “${DST_PATH}.snap”

to

$RM -rf “${DST_PATH}.snap”

Please let me know is the proposed changes working so I can generate push requests for upstream merge.

Kind Regards,
Anton Todorov

Hi Anton,
I applied the changes you suggested but I still get the same error below:

but instead getting this one regarding ln script :

I get the following and the vm goes in poweroff state:

Tue Oct 11 08:21:09 2016 [Z0][VMM][I]: Command execution fail: cat << EOT | /var/tmp/one/vmm/kvm/attach_disk ‘one-913’ ‘/var/lib/one//datastores/0/913/disk.2’ ‘vdb’ ‘1’ ‘DATA’ 913 myhost
Tue Oct 11 08:21:09 2016 [Z0][VMM][E]: attach_disk: Command “virsh --connect qemu:///system attach-device one-913 /var/lib/one//datastores/0/913/disk.2.attach” failed: error: Failed to attach device from /var/lib/one//datastores/0/913/disk.2.attach
Tue Oct 11 08:21:09 2016 [Z0][VMM][I]: error: Unable to read from monitor: Connection reset by peer
Tue Oct 11 08:21:09 2016 [Z0][VMM][E]: Could not attach /var/lib/one//datastores/0/913/disk.2 (vdb) to one-913
Tue Oct 11 08:21:09 2016 [Z0][VMM][I]: ExitCode: 1
Tue Oct 11 08:21:09 2016 [Z0][VMM][I]: Successfully execute transfer manager driver operation: tm_detach.
Tue Oct 11 08:21:09 2016 [Z0][VMM][I]: Failed to execute virtualization driver operation: attach_disk.
Tue Oct 11 08:21:09 2016 [Z0][VMM][E]: Error attaching new VM Disk: Could not attach /var/lib/one//datastores/0/913/disk.2 (vdb) to one-913
Tue Oct 11 08:21:09 2016 [Z0][VM][I]: New LCM state is RUNNING
Tue Oct 11 08:21:51 2016 [Z0][LCM][I]: VM running but monitor state is POWEROFF
Tue Oct 11 08:21:51 2016 [Z0][VM][I]: New LCM state is SHUTDOWN_POWEROFF
Tue Oct 11 08:21:51 2016 [Z0][VM][I]: New state is POWEROFF
Tue Oct 11 08:21:51 2016 [Z0][VM][I]: New LCM state is LCM_INIT

Hi Valentina,

My bad, missed a dort(.) before the snap please try this one dor /var/lib/one/remotes/tm/qcow2/clone

and for /var/lib/one/remotes/tm/qcow2/delete

Regarding the ln failure, it is strange because the rm -rf is always returning true

Please try witht he updated fixes and paste the new logs if there is failure.

Kind Regards,
Anton Todorov

Hi Anton,
all works as expected!
Let me know if I need to open a bug or a push request.

Best Regards,
Valentina

Hi Valentina,

I am glad it works. I’ll add pull requests linking to this forum topic.

Best Regards,
Anton Todorov

Hi Anton,
I was doing same tests with “shared” transport mode and I noticed a very bad thing.
The clone script in /var/lib/one/remotes/tm/shared/clone doesn’t have

rm -f $DST_PATH

as the one for qcow2 tm.
This means that, if you attach to a persistent disk to a vm, detach it and attach another non-persistent one, you destroy the persistent disk (previosly detached) because the symbolic link still exist in my VM directory /var/lib/one/datastore/0/VMID.
I don’t get any error in the logs because the disk is mounted “correctly”.
I tried to add the remove command in the clone script and resolve the bug.
Can you take a look please?
Best Regards,
Valentina

Hi Valentina,

Please try patching /var/lib/one/remotes/tm/shared/clone to add rm -rf ${DST_PATH}

CLONE_CMD=“cd ${DST_DIR};
rm -rf ${DST_PATH};
cp ${SRC_PATH} ${DST_PATH}
${RESIZE_CMD}”

Best,
Anton

Hi Valentina,

I would like to note that this patch will work but the root cause is that there is no cleanup after persistent disk detach. The proper fix is to develop mvds script that needs more time and testing.

Kinf Regards,
Anton Todorov

Hi Anton,
the patch works.
Thanks again for the help.

King Regards,
Valentina

Hi,

Just an update, we are going to keep the original patches from Anton (in fact some rm operations were already in place). We prefer not to perform additional calls to ssh in mvds, and protect the disks from potential failures executing mvds. So the idea is remove any target link before performing any destructive operation on the target links.

The affected scripts are:
qcow2/clone
qcow2/ln
shared/clone
shared/ln

(Note that some of them were already working as described)

Cheers