Weird error when stopping VM: /var/lib/one/remotes/tm/ssh/mv: line 66: [: missing `]'

Hello,

When stopping VM’s one fails to move the image from system to image datastore. Here is the log:

Fri Nov 13 13:55:07 2020 [Z0][VM][I]: New LCM state is EPILOG_STOP
Fri Nov 13 13:55:09 2020 [Z0][TM][I]: Command execution failed (exit code: 2): /var/lib/one/remotes/tm/ssh/mv hoppara-test:/var/lib/one//datastores/0/62 s629680.dedi.leaseweb.net:/var/lib/one//datastores/0/62 62 0
Fri Nov 13 13:55:09 2020 [Z0][TM][I]: /var/lib/one/remotes/tm/ssh/mv: line 66: [: missing `]’
Fri Nov 13 13:55:09 2020 [Z0][TM][I]: mv: Moving hoppara-test:/var/lib/one/datastores/0/62 to s629680.dedi.leaseweb.net:/var/lib/one/datastores/0/62
Fri Nov 13 13:55:09 2020 [Z0][TM][E]: mv: Command “set -e -o pipefail
Fri Nov 13 13:55:09 2020 [Z0][TM][I]:
Fri Nov 13 13:55:09 2020 [Z0][TM][I]: tar -C /var/lib/one/datastores/0 --sparse -cf - 62 | ssh s629680.dedi.leaseweb.net ‘tar -C /var/lib/one/datastores/0 --sparse -xf -’
Fri Nov 13 13:55:09 2020 [Z0][TM][I]: rm -rf /var/lib/one/datastores/0/62” failed: tar: 62: Cannot stat: No such file or directory
Fri Nov 13 13:55:09 2020 [Z0][TM][I]: tar: Exiting with failure status due to previous errors
Fri Nov 13 13:55:09 2020 [Z0][TM][E]: Error copying disk directory to target host
Fri Nov 13 13:55:09 2020 [Z0][TM][E]: Error executing image transfer script: Error copying disk directory to target host
Fri Nov 13 13:55:09 2020 [Z0][VM][I]: New LCM state is EPILOG_STOP_FAILURE

The weird thing is I don’t see anything wrong with the bash script /var/lib/one/remotes/tm/ssh/mv

Another weirdness that may be related is, these two datastore are on the same host. hoppara-test is current hostname, s629680.dedi.leaseweb.net is the hostname that the machine first booted in. both names resolve to localhost from /etc/hosts file. But where does one know about s629680.dedi.leaseweb.net?

Please, describe the problem here and provide additional information below (if applicable) …


Versions of the related components and OS (frontend, hypervisors, VMs):
OpenNebula 5.12.0.1
CentOS Linux release 8.2.2004 (Core)

The ERROR is

Fri Nov 13 13:55:09 2020 [Z0][TM][E]: mv: Command “set -e -o pipefail
Fri Nov 13 13:55:09 2020 [Z0][TM][I]:
Fri Nov 13 13:55:09 2020 [Z0][TM][I]: tar -C /var/lib/one/datastores/0 --sparse -cf - 62 | ssh s629680.dedi.leaseweb.net ‘tar -C /var/lib/one/datastores/0 --sparse -xf -’

and the exact command failing is

Fri Nov 13 13:55:09 2020 [Z0][TM][I]: rm -rf /var/lib/one/datastores/0/62” failed: tar: 62: Cannot stat: No such file or directory

which suggest that

tar -C /var/lib/one/datastores/0 --sparse -cf - 62 

fails because there is no directory 62 which is the datastore directory where the VM files are supposed to be stored (deployment and disk files)

figure out in your setup why is the directory missing and try to manually run the command in order to test it works.

Here is the bug in /var/lib/one/remotes/tm/ssh/mv line 66:
if [ -n “$SRC_INODE”] && [ -n “$DST_INODE” ] && [ “$SRC_INODE” = “$DST_INODE” ]; then

missing space after “$SRC_INODE” and ]

Best regards,
AR

it is fixed in 5.12.0.3: M #-: fixed incorrect bash syntax (#4991) · OpenNebula/one@1cf1de9 · GitHub

Thanks, good to know.

This bug actually cancels VM from disk, because it wrongly assumes that host is changed and tries to transfer VM to the same host, but failing, because in cancels it before starting.

This actually happened in production, thankfully I had datastore on ZFS with regular snapshots.

Best regards,
AR