Onevm migrate and cloned images

timm · April 17, 2015, 7:51pm

In my OpenNebula 4.8 installation I have one Image datastore on NFS and one System datastore which
is local to the disk.

[root@fclheadgpvm01 init]# onedatastore show 102
DATASTORE 102 INFORMATION
ID : 102
NAME : cloud_images
USER : oneadmin
GROUP : oneadmin
CLUSTER : -
TYPE : IMAGE
DS_MAD : fs
TM_MAD : shared
BASE PATH : /var/lib/one/datastores/102
DISK_TYPE : FILE

DATASTORE CAPACITY
TOTAL: : 20T
FREE: : 11.8T
USED: : 1.1T
LIMIT: : -

PERMISSIONS
OWNER : um-
GROUP : u–
OTHER : —

DATASTORE TEMPLATE
BASE_PATH="/var/lib/one/datastores/"
CLONE_TARGET="SYSTEM"
DATASTORE_CAPACITY_CHECK="NO"
DISK_TYPE="FILE"
DS_MAD="fs"
LN_TARGET="NONE"
TM_MAD="shared"
TYPE=“IMAGE_DS”

[root@fclheadgpvm01 init]# onedatastore show 104
DATASTORE 104 INFORMATION
ID : 104
NAME : localnode_fcl
USER : oneadmin
GROUP : oneadmin
CLUSTER : fcl
TYPE : SYSTEM
DS_MAD : -
TM_MAD : ssh
BASE PATH : /var/lib/one/datastores/104
DISK_TYPE : FILE

DATASTORE CAPACITY
TOTAL: : -
FREE: : -
USED: : -
LIMIT: : -

PERMISSIONS
OWNER : um-
GROUP : u–
OTHER : —

DATASTORE TEMPLATE
BASE_PATH="/var/lib/one/datastores/"
DATASTORE_CAPACITY_CHECK="NO"
SHARED="NO"
TM_MAD="ssh"
TYPE=“SYSTEM_DS”

I have two types of image–Persistent images use the LINK primitive to get started
and they run directly off of the NFS store. They can migrate just fine.

Non-persistent images use the CLONE primitive to make a copy to local
disk which is on /var/lib/one/datastores/104 on each local disk but
not shared between the various VM hosts. Onevm migrate on these type
of VM’s does not work.

Error is:
ERROR=“Tue Apr 14 16:12:34 2015 : Error restoring VM: Could not restore from /var/lib/one//datastores/104/1669/checkpoint”

At this point the VM goes to FAILED state and is unrecoverable
[root@fclheadgpvm01 init]# cat /var/log/one/1669.log-20150415
Tue Apr 14 16:10:56 2015 [Z0][LCM][I]: New VM state is SAVE_MIGRATE
Tue Apr 14 16:11:06 2015 [Z0][VMM][I]: ExitCode: 0
Tue Apr 14 16:11:06 2015 [Z0][VMM][I]: Successfully execute virtualization driver operation: save.
Tue Apr 14 16:11:06 2015 [Z0][VMM][I]: ExitCode: 0
Tue Apr 14 16:11:06 2015 [Z0][VMM][I]: Successfully execute network driver operation: clean.
Tue Apr 14 16:11:06 2015 [Z0][LCM][I]: New VM state is PROLOG_MIGRATE
Tue Apr 14 16:12:33 2015 [Z0][LCM][I]: New VM state is BOOT
Tue Apr 14 16:12:34 2015 [Z0][VMM][I]: ExitCode: 0
Tue Apr 14 16:12:34 2015 [Z0][VMM][I]: Successfully execute network driver operation: pre.
Tue Apr 14 16:12:34 2015 [Z0][VMM][I]: Command execution fail: /var/tmp/one/vmm/kvm/restore ‘/var/lib/one//datastores/104/1669/checkpoint’ ‘fcl321’ ‘one-1669’ 1669 fcl321
Tue Apr 14 16:12:34 2015 [Z0][VMM][E]: restore: Command “virsh --connect qemu:///system restore /var/lib/one//datastores/104/1669/checkpoint” failed: error: Failed to restore domain from /var/lib/one//datastores/104/1669/checkpoint
Tue Apr 14 16:12:34 2015 [Z0][VMM][I]: error: Unable to read from monitor: Connection reset by peer
Tue Apr 14 16:12:34 2015 [Z0][VMM][E]: Could not restore from /var/lib/one//datastores/104/1669/checkpoint
Tue Apr 14 16:12:34 2015 [Z0][VMM][I]: ExitCode: 1
Tue Apr 14 16:12:34 2015 [Z0][VMM][I]: Failed to execute virtualization driver operation: restore.
Tue Apr 14 16:12:34 2015 [Z0][VMM][E]: Error restoring VM: Could not restore from /var/lib/one//datastores/104/1669/checkpoint
Tue Apr 14 16:12:34 2015 [Z0][DiM][I]: New VM state is FAILED
Tue Apr 14 16:35:10 2015 [Z0][DiM][I]: New VM state is DONE.
Tue Apr 14 16:35:11 2015 [Z0][TM][W]: Ignored: TRANSFER SUCCESS 1669 -

Any suggestions on how to make this work?
I know I could create a second Image store that is ssh-based but would like to do it
with just one image store if possible.

Steve Timm

Under OpenNebula3 we were able to do migrations of this kind but we had to
hack the transfer manager scripts to make this work.

jfontan · April 22, 2015, 10:14am

The non live migrate should work with ssh drivers as is. Can you check the file /var/log/libvirt/qemu/one-1669.log to get more information on what failed? virsh command is not giving enough information.

timm · April 22, 2015, 1:21pm

Javi–the contents of the 1669.log were appended to the original forum post.
The ssh drivers should work OK as you say but in an NFS clone situation like that I believe that the
ssh drivers are not being used.

I think what you need are the contents of transfer.1.migrate I include
it below.

[root@fclheadgpvm01 1669]# cat transfer.1.migrate

MV shared fcl413:/var/lib/one//datastores/104/1669/disk.0 fcl321:/var/lib/one//datastores/104/1669/disk.0 1669 102

MV ssh fcl413:/var/lib/one//datastores/104/1669 fcl321:/var/lib/one//datastores/104/1669 1669 104

In any case the reason that the virsh restore failed to restore from the checkpoint file, is that
the checkpoint file is not there, it did not get copied from node A to node B.

Steve Timm

timm · April 22, 2015, 2:23pm

I tried again and I think I see the problem, but not how to fix it.

All the files were indeed copied from local /var/lib/one/datastores/104/
on one node to the same local directory on the other node. but they are the wrong permissions.

Before:

[root@fcl002 1853]# ls -lrt
total 2614824
-rw-r–r-- 1 oneadmin oneadmin 382976 Apr 22 08:55 disk.1
lrwxrwxrwx 1 oneadmin oneadmin 39 Apr 22 08:55 disk.1.iso -> /var/lib/one/datastores/104/1853/disk.1
-rw-r–r-- 1 oneadmin oneadmin 1208 Apr 22 08:55 deployment.0
-rw-rw-r-- 1 oneadmin oneadmin 2469986304 Apr 22 09:09 disk.0
-rw-rw-rw- 1 oneadmin oneadmin 207200951 Apr 22 09:09 checkpoint

After:

bash-4.1$ pwd
/var/lib/one/datastores/104/1853
bash-4.1$ ls -l
total 2614824
-rw-r–r-- 1 oneadmin oneadmin 207200951 Apr 22 09:09 checkpoint
-rw-r–r-- 1 oneadmin oneadmin 1208 Apr 22 08:55 deployment.0
-rw-r–r-- 1 oneadmin oneadmin 2469986304 Apr 22 09:09 disk.0
-rw-r–r-- 1 oneadmin oneadmin 382976 Apr 22 08:55 disk.1
lrwxrwxrwx 1 oneadmin oneadmin 39 Apr 22 09:10 disk.1.iso -> /var/lib/one/datastores/104/1853/disk.1

And thus the error we are getting is a “permission denied”

bash-4.1$ virsh --connect qemu:///system restore /var/lib/one/datastores/104/1853/checkpoint
error: Failed to restore domain from /var/lib/one/datastores/104/1853/checkpoint
error: internal error process exited while connecting to monitor: char device redirected to /dev/pts/6
2015-04-22T14:16:41.443058Z qemu-kvm: -drive file=/var/lib/one//datastores/104/1853/disk.0,if=none,id=drive-virtio-disk0,format=qcow2: could not open disk image /var/lib/one//datastores/104/1853/disk.0: Permission denied

Basically the files are losing the needed permissions to launch when they get scp’ed from
one node to the other. Even modifying the ssh driver to do an scp -p
would work. If I change the permissions manually of the disk.0 and checkpoint files
I can manualy restart the VM with the same permissionis.

bash-4.1$ chmod 666 disk.0
bash-4.1$ chmod 666 checkpoint
bash-4.1$ virsh --connect qemu:///system restore /var/lib/one/datastores/104/1853/checkpoint
Domain restored from /var/lib/one/datastores/104/1853/checkpoint

What to do? Why are the permissions being lost?

I remind you that my qemu.conf has
dynamic_ownership = 0
so all my vm’s are running as the qemu user which is part of the
oneadmin group. that’s why the 666 and 664 permissions are necessary at the moment.

atodorov_storpool · April 22, 2015, 2:29pm

It looks like on destination host there is umask 0022 for oneadmin user.
Changing umask to 0002 should fix permissions

atodorov_storpool · April 22, 2015, 2:32pm

forgot to paste example:

[oneadmin@s04 datastore]$ umask
0002
[oneadmin@s04 datastore]$ umask 0022
[oneadmin@s04 datastore]$ umask
0022
[oneadmin@s04 datastore]$ :>test
[oneadmin@s04 datastore]$ ls -la test
-rw-r–r-- 1 oneadmin oneadmin 0 Apr 22 17:27 test
[oneadmin@s04 datastore]$ umask 0002
[oneadmin@s04 datastore]$ :>test2
[oneadmin@s04 datastore]$ ls -la test*
-rw-r–r-- 1 oneadmin oneadmin 0 Apr 22 17:27 test
-rw-rw-r-- 1 oneadmin oneadmin 0 Apr 22 17:28 test2

timm · April 22, 2015, 2:56pm

Hi Anton–I checked and on all hosts in question the umask is already 0002
for the oneadmin user.
-bash-4.1$ echo “this is test” >> foo
-bash-4.1$ ls -al foo
-rw-rw-r-- 1 oneadmin oneadmin 14 Apr 22 09:53 foo
-bash-4.1$ hostname
fcl013.fnal.gov
-bash-4.1$ scp foo oneadmin@fcl002:~/foo002
foo 100% 14 0.0KB/s
00:00
-bash-4.1$ umask
0002

atodorov_storpool · April 22, 2015, 3:13pm

Checked out one-4.8 sources to clarify. This is exact command to move the system datastore:

TAR_COPY=“$SSH $SRC_HOST ‘$TAR -C $SRC_DS_DIR --sparse -cf - $SRC_VM_DIR’”
TAR_COPY=“$TAR_COPY | $SSH $DST_HOST ‘$TAR -C $DST_DIR --sparse -xf -’”

exec_and_log “eval $TAR_COPY” “Error copying disk directory to target host”

All files with their attributes are packed in a tar container which is passed to the destination host.
I am almost sure somewhere on the way there is umask 0022 enforced. I can not see another explanation for attributes change.

What OSes are running front-end and nodes?

BR,
Anton Todorov

timm · April 22, 2015, 3:38pm

I looked a bit more carefully. In general umask is different between login shell
and non-login shell. Witness this from one of my vm hosts fcl013

[timm@snowball imaging]$ ssh oneadmin@fcl013 umask
0022

[timm@snowball imaging]$ ssh oneadmin@fcl013
Last login: Wed Apr 22 10:21:08 2015 from 131.225.80.124

-bash-4.1$ umask
0002

So now all I have to figure out is why the system is setting the umask for the non-interactive
shell to 0022. Do we know in the case of an scp–does either bashrc or profile get executed?
/etc/profile is clearly for the login shell.
This code in /etc/bashrc seems to indicate that for user oneadmin (uid 44897 gid 10040,
oneadmin:oneadmin, that the umask should be set to 0002 but it does not appear to be
doing so.

if [ $UID -gt 199 ] && [ "`id -gn`" = "`id -un`" ]; then
   umask 002
else
   umask 022
fi

timm · April 22, 2015, 3:40pm

After further tests it is clear that scp does not even source /etc/bashrc.
So what are my options to get a umask that scp will respect?

Per the earlier question…my VM hosts are running Scientific Linux 6 which is an open-source
rebuild of RHEl6.

Steve Timm

timm · April 22, 2015, 3:44pm

PS–I see from the above that opennebula is not doing an scp–but the same
issue still applies whether it is doing a file-by-file scp copy or a non-interactive login
ssh section. the umask is not getting set to the value we want because neither /etc/profile
or /etc/bashrc are being executed. Suggestions?

atodorov_storpool · April 22, 2015, 4:05pm

Temp solution is to add ‘p’ option to the tar extract commad in /var/lib/one/remotes/tm/ssh/mv:

TAR_COPY=“$SSH $SRC_HOST ‘$TAR -C $SRC_DS_DIR --sparse -cf - $SRC_VM_DIR’”
TAR_COPY=“$TAR_COPY | $SSH $DST_HOST ‘$TAR -C $DST_DIR --sparse -xpf -’”

from tar man page:

  -p, --preserve-permissions
         extract  information  about  file permissions (default for superuser)

example test scenario:

[oneadmin@s04 ~]$ umask 0022
[oneadmin@s04 ~]$ tar -xf tarfile
[oneadmin@s04 ~]$ ls -la test
total 8
drwxr-xr-x 2 oneadmin oneadmin 17 Apr 22 18:55 .
drwxr-x— 12 oneadmin oneadmin 4096 Apr 22 18:57 …
-rw-r–r-- 1 oneadmin oneadmin 30 Apr 22 18:55 file
[oneadmin@s04 ~]$ tar -xpf tarfile
[oneadmin@s04 ~]$ ls -la test
total 8
drwxrwxr-x 2 oneadmin oneadmin 17 Apr 22 18:55 .
drwxr-x— 12 oneadmin oneadmin 4096 Apr 22 18:57 …
-rw-rw-r-- 1 oneadmin oneadmin 30 Apr 22 18:55 file

BR,
Anton Todorov

timm · June 3, 2015, 5:04pm

One followup question:
Is the right file to patch
/var/lib/one/remotes/tm/ssh/mv
?

Or is there one file outside of remotes, somewhere, that we have to patch as well?

atodorov_storpool · June 3, 2015, 7:07pm

Hi,

I do not see other places where $TAR is used.

Another way is to patch $TAR definition which is located in scripts_common.sh but there are two places where I can find them:

/usr/lib/one/sh/scripts_common.sh
/var/lib/one/remotes/scripts_common.sh

I would bet on the second one but ‘tm/ssh/mv’ looks fine anyway. You should remember to fix it after each OpenNebula update/upgrade though.

Best Regard,
Anton

timm · June 3, 2015, 7:15pm

Thanks Anton…
Modifying tm/ssh/mv worked. I am able to migrate machines again now.
I don’t anticipate migrating out of OpenNebula 4.8 or upgrading
within the next year or two.

Steve Timm

Topic		Replies	Views
Error executing image transfer script: Error copying Product Support	22	10681	December 14, 2016
Cannot copy image from one host to another Product Support	1	766	June 1, 2018
Migrate mv to another host Product Support	12	2374	August 10, 2018
[SOLVED] Error copying image via ssh Product Support	7	5591	January 29, 2017
Error to create VM Product Support	23	4103	February 20, 2017

Onevm migrate and cloned images

DATASTORE TEMPLATE BASE_PATH="/var/lib/one/datastores/" CLONE_TARGET="SYSTEM" DATASTORE_CAPACITY_CHECK="NO" DISK_TYPE="FILE" DS_MAD="fs" LN_TARGET="NONE" TM_MAD="shared" TYPE=“IMAGE_DS”

DATASTORE TEMPLATE BASE_PATH="/var/lib/one/datastores/" DATASTORE_CAPACITY_CHECK="NO" SHARED="NO" TM_MAD="ssh" TYPE=“SYSTEM_DS”

-bash-4.1$ umask 0002

Related topics

DATASTORE TEMPLATE
BASE_PATH="/var/lib/one/datastores/"
CLONE_TARGET="SYSTEM"
DATASTORE_CAPACITY_CHECK="NO"
DISK_TYPE="FILE"
DS_MAD="fs"
LN_TARGET="NONE"
TM_MAD="shared"
TYPE=“IMAGE_DS”

DATASTORE TEMPLATE
BASE_PATH="/var/lib/one/datastores/"
DATASTORE_CAPACITY_CHECK="NO"
SHARED="NO"
TM_MAD="ssh"
TYPE=“SYSTEM_DS”

-bash-4.1$ umask
0002