Onevm migrate and cloned images

In my OpenNebula 4.8 installation I have one Image datastore on NFS and one System datastore which
is local to the disk.

[root@fclheadgpvm01 init]# onedatastore show 102
DATASTORE 102 INFORMATION
ID : 102
NAME : cloud_images
USER : oneadmin
GROUP : oneadmin
CLUSTER : -
TYPE : IMAGE
DS_MAD : fs
TM_MAD : shared
BASE PATH : /var/lib/one/datastores/102
DISK_TYPE : FILE

DATASTORE CAPACITY
TOTAL: : 20T
FREE: : 11.8T
USED: : 1.1T
LIMIT: : -

PERMISSIONS
OWNER : um-
GROUP : u–
OTHER : —

DATASTORE TEMPLATE
BASE_PATH="/var/lib/one/datastores/"
CLONE_TARGET="SYSTEM"
DATASTORE_CAPACITY_CHECK="NO"
DISK_TYPE="FILE"
DS_MAD="fs"
LN_TARGET="NONE"
TM_MAD="shared"
TYPE=“IMAGE_DS”

[root@fclheadgpvm01 init]# onedatastore show 104
DATASTORE 104 INFORMATION
ID : 104
NAME : localnode_fcl
USER : oneadmin
GROUP : oneadmin
CLUSTER : fcl
TYPE : SYSTEM
DS_MAD : -
TM_MAD : ssh
BASE PATH : /var/lib/one/datastores/104
DISK_TYPE : FILE

DATASTORE CAPACITY
TOTAL: : -
FREE: : -
USED: : -
LIMIT: : -

PERMISSIONS
OWNER : um-
GROUP : u–
OTHER : —

DATASTORE TEMPLATE
BASE_PATH="/var/lib/one/datastores/"
DATASTORE_CAPACITY_CHECK="NO"
SHARED="NO"
TM_MAD="ssh"
TYPE=“SYSTEM_DS”

I have two types of image–Persistent images use the LINK primitive to get started
and they run directly off of the NFS store. They can migrate just fine.

Non-persistent images use the CLONE primitive to make a copy to local
disk which is on /var/lib/one/datastores/104 on each local disk but
not shared between the various VM hosts. Onevm migrate on these type
of VM’s does not work.

Error is:
ERROR=“Tue Apr 14 16:12:34 2015 : Error restoring VM: Could not restore from /var/lib/one//datastores/104/1669/checkpoint”

At this point the VM goes to FAILED state and is unrecoverable
[root@fclheadgpvm01 init]# cat /var/log/one/1669.log-20150415
Tue Apr 14 16:10:56 2015 [Z0][LCM][I]: New VM state is SAVE_MIGRATE
Tue Apr 14 16:11:06 2015 [Z0][VMM][I]: ExitCode: 0
Tue Apr 14 16:11:06 2015 [Z0][VMM][I]: Successfully execute virtualization driver operation: save.
Tue Apr 14 16:11:06 2015 [Z0][VMM][I]: ExitCode: 0
Tue Apr 14 16:11:06 2015 [Z0][VMM][I]: Successfully execute network driver operation: clean.
Tue Apr 14 16:11:06 2015 [Z0][LCM][I]: New VM state is PROLOG_MIGRATE
Tue Apr 14 16:12:33 2015 [Z0][LCM][I]: New VM state is BOOT
Tue Apr 14 16:12:34 2015 [Z0][VMM][I]: ExitCode: 0
Tue Apr 14 16:12:34 2015 [Z0][VMM][I]: Successfully execute network driver operation: pre.
Tue Apr 14 16:12:34 2015 [Z0][VMM][I]: Command execution fail: /var/tmp/one/vmm/kvm/restore ‘/var/lib/one//datastores/104/1669/checkpoint’ ‘fcl321’ ‘one-1669’ 1669 fcl321
Tue Apr 14 16:12:34 2015 [Z0][VMM][E]: restore: Command “virsh --connect qemu:///system restore /var/lib/one//datastores/104/1669/checkpoint” failed: error: Failed to restore domain from /var/lib/one//datastores/104/1669/checkpoint
Tue Apr 14 16:12:34 2015 [Z0][VMM][I]: error: Unable to read from monitor: Connection reset by peer
Tue Apr 14 16:12:34 2015 [Z0][VMM][E]: Could not restore from /var/lib/one//datastores/104/1669/checkpoint
Tue Apr 14 16:12:34 2015 [Z0][VMM][I]: ExitCode: 1
Tue Apr 14 16:12:34 2015 [Z0][VMM][I]: Failed to execute virtualization driver operation: restore.
Tue Apr 14 16:12:34 2015 [Z0][VMM][E]: Error restoring VM: Could not restore from /var/lib/one//datastores/104/1669/checkpoint
Tue Apr 14 16:12:34 2015 [Z0][DiM][I]: New VM state is FAILED
Tue Apr 14 16:35:10 2015 [Z0][DiM][I]: New VM state is DONE.
Tue Apr 14 16:35:11 2015 [Z0][TM][W]: Ignored: TRANSFER SUCCESS 1669 -


Any suggestions on how to make this work?
I know I could create a second Image store that is ssh-based but would like to do it
with just one image store if possible.

Steve Timm

Under OpenNebula3 we were able to do migrations of this kind but we had to
hack the transfer manager scripts to make this work.

The non live migrate should work with ssh drivers as is. Can you check the file /var/log/libvirt/qemu/one-1669.log to get more information on what failed? virsh command is not giving enough information.

Javi–the contents of the 1669.log were appended to the original forum post.
The ssh drivers should work OK as you say but in an NFS clone situation like that I believe that the
ssh drivers are not being used.

I think what you need are the contents of transfer.1.migrate I include
it below.

[root@fclheadgpvm01 1669]# cat transfer.1.migrate

MV shared fcl413:/var/lib/one//datastores/104/1669/disk.0 fcl321:/var/lib/one//datastores/104/1669/disk.0 1669 102

MV ssh fcl413:/var/lib/one//datastores/104/1669 fcl321:/var/lib/one//datastores/104/1669 1669 104

In any case the reason that the virsh restore failed to restore from the checkpoint file, is that
the checkpoint file is not there, it did not get copied from node A to node B.

Steve Timm

I tried again and I think I see the problem, but not how to fix it.

All the files were indeed copied from local /var/lib/one/datastores/104/
on one node to the same local directory on the other node. but they are the wrong permissions.

Before:

[root@fcl002 1853]# ls -lrt
total 2614824
-rw-r–r-- 1 oneadmin oneadmin 382976 Apr 22 08:55 disk.1
lrwxrwxrwx 1 oneadmin oneadmin 39 Apr 22 08:55 disk.1.iso -> /var/lib/one/datastores/104/1853/disk.1
-rw-r–r-- 1 oneadmin oneadmin 1208 Apr 22 08:55 deployment.0
-rw-rw-r-- 1 oneadmin oneadmin 2469986304 Apr 22 09:09 disk.0
-rw-rw-rw- 1 oneadmin oneadmin 207200951 Apr 22 09:09 checkpoint

After:

bash-4.1$ pwd
/var/lib/one/datastores/104/1853
bash-4.1$ ls -l
total 2614824
-rw-r–r-- 1 oneadmin oneadmin 207200951 Apr 22 09:09 checkpoint
-rw-r–r-- 1 oneadmin oneadmin 1208 Apr 22 08:55 deployment.0
-rw-r–r-- 1 oneadmin oneadmin 2469986304 Apr 22 09:09 disk.0
-rw-r–r-- 1 oneadmin oneadmin 382976 Apr 22 08:55 disk.1
lrwxrwxrwx 1 oneadmin oneadmin 39 Apr 22 09:10 disk.1.iso -> /var/lib/one/datastores/104/1853/disk.1

And thus the error we are getting is a “permission denied”

bash-4.1$ virsh --connect qemu:///system restore /var/lib/one/datastores/104/1853/checkpoint
error: Failed to restore domain from /var/lib/one/datastores/104/1853/checkpoint
error: internal error process exited while connecting to monitor: char device redirected to /dev/pts/6
2015-04-22T14:16:41.443058Z qemu-kvm: -drive file=/var/lib/one//datastores/104/1853/disk.0,if=none,id=drive-virtio-disk0,format=qcow2: could not open disk image /var/lib/one//datastores/104/1853/disk.0: Permission denied


Basically the files are losing the needed permissions to launch when they get scp’ed from
one node to the other. Even modifying the ssh driver to do an scp -p
would work. If I change the permissions manually of the disk.0 and checkpoint files
I can manualy restart the VM with the same permissionis.

bash-4.1$ chmod 666 disk.0
bash-4.1$ chmod 666 checkpoint
bash-4.1$ virsh --connect qemu:///system restore /var/lib/one/datastores/104/1853/checkpoint
Domain restored from /var/lib/one/datastores/104/1853/checkpoint

What to do? Why are the permissions being lost?

I remind you that my qemu.conf has
dynamic_ownership = 0
so all my vm’s are running as the qemu user which is part of the
oneadmin group. that’s why the 666 and 664 permissions are necessary at the moment.

It looks like on destination host there is umask 0022 for oneadmin user.
Changing umask to 0002 should fix permissions

forgot to paste example:

[oneadmin@s04 datastore]$ umask
0002
[oneadmin@s04 datastore]$ umask 0022
[oneadmin@s04 datastore]$ umask
0022
[oneadmin@s04 datastore]$ :>test
[oneadmin@s04 datastore]$ ls -la test
-rw-r–r-- 1 oneadmin oneadmin 0 Apr 22 17:27 test
[oneadmin@s04 datastore]$ umask 0002
[oneadmin@s04 datastore]$ :>test2
[oneadmin@s04 datastore]$ ls -la test*
-rw-r–r-- 1 oneadmin oneadmin 0 Apr 22 17:27 test
-rw-rw-r-- 1 oneadmin oneadmin 0 Apr 22 17:28 test2

Hi Anton–I checked and on all hosts in question the umask is already 0002
for the oneadmin user.
-bash-4.1$ echo “this is test” >> foo
-bash-4.1$ ls -al foo
-rw-rw-r-- 1 oneadmin oneadmin 14 Apr 22 09:53 foo
-bash-4.1$ hostname
fcl013.fnal.gov
-bash-4.1$ scp foo oneadmin@fcl002:~/foo002
foo 100% 14 0.0KB/s
00:00
-bash-4.1$ umask
0002

1 Like

Checked out one-4.8 sources to clarify. This is exact command to move the system datastore:

TAR_COPY=“$SSH $SRC_HOST ‘$TAR -C $SRC_DS_DIR --sparse -cf - $SRC_VM_DIR’”
TAR_COPY=“$TAR_COPY | $SSH $DST_HOST ‘$TAR -C $DST_DIR --sparse -xf -’”

exec_and_log “eval $TAR_COPY” “Error copying disk directory to target host”

All files with their attributes are packed in a tar container which is passed to the destination host.
I am almost sure somewhere on the way there is umask 0022 enforced. I can not see another explanation for attributes change.

What OSes are running front-end and nodes?

BR,
Anton Todorov

I looked a bit more carefully. In general umask is different between login shell
and non-login shell. Witness this from one of my vm hosts fcl013

[timm@snowball imaging]$ ssh oneadmin@fcl013 umask
0022

[timm@snowball imaging]$ ssh oneadmin@fcl013
Last login: Wed Apr 22 10:21:08 2015 from 131.225.80.124

-bash-4.1$ umask
0002

So now all I have to figure out is why the system is setting the umask for the non-interactive
shell to 0022. Do we know in the case of an scp–does either bashrc or profile get executed?
/etc/profile is clearly for the login shell.
This code in /etc/bashrc seems to indicate that for user oneadmin (uid 44897 gid 10040,
oneadmin:oneadmin, that the umask should be set to 0002 but it does not appear to be
doing so.

if [ $UID -gt 199 ] && [ "`id -gn`" = "`id -un`" ]; then
   umask 002
else
   umask 022
fi

After further tests it is clear that scp does not even source /etc/bashrc.
So what are my options to get a umask that scp will respect?

Per the earlier question…my VM hosts are running Scientific Linux 6 which is an open-source
rebuild of RHEl6.

Steve Timm

PS–I see from the above that opennebula is not doing an scp–but the same
issue still applies whether it is doing a file-by-file scp copy or a non-interactive login
ssh section. the umask is not getting set to the value we want because neither /etc/profile
or /etc/bashrc are being executed. Suggestions?

Temp solution is to add ‘p’ option to the tar extract commad in /var/lib/one/remotes/tm/ssh/mv:

TAR_COPY=“$SSH $SRC_HOST ‘$TAR -C $SRC_DS_DIR --sparse -cf - $SRC_VM_DIR’”
TAR_COPY=“$TAR_COPY | $SSH $DST_HOST ‘$TAR -C $DST_DIR --sparse -xpf -’”

from tar man page:

  -p, --preserve-permissions
         extract  information  about  file permissions (default for superuser)

example test scenario:

[oneadmin@s04 ~]$ umask 0022
[oneadmin@s04 ~]$ tar -xf tarfile
[oneadmin@s04 ~]$ ls -la test
total 8
drwxr-xr-x 2 oneadmin oneadmin 17 Apr 22 18:55 .
drwxr-x— 12 oneadmin oneadmin 4096 Apr 22 18:57 …
-rw-r–r-- 1 oneadmin oneadmin 30 Apr 22 18:55 file
[oneadmin@s04 ~]$ tar -xpf tarfile
[oneadmin@s04 ~]$ ls -la test
total 8
drwxrwxr-x 2 oneadmin oneadmin 17 Apr 22 18:55 .
drwxr-x— 12 oneadmin oneadmin 4096 Apr 22 18:57 …
-rw-rw-r-- 1 oneadmin oneadmin 30 Apr 22 18:55 file

BR,
Anton Todorov

One followup question:
Is the right file to patch
/var/lib/one/remotes/tm/ssh/mv
?

Or is there one file outside of remotes, somewhere, that we have to patch as well?

Hi,

I do not see other places where $TAR is used.

Another way is to patch $TAR definition which is located in scripts_common.sh but there are two places where I can find them:

/usr/lib/one/sh/scripts_common.sh
/var/lib/one/remotes/scripts_common.sh

I would bet on the second one but ‘tm/ssh/mv’ looks fine anyway. You should remember to fix it after each OpenNebula update/upgrade though.

Best Regard,
Anton

Thanks Anton…
Modifying tm/ssh/mv worked. I am able to migrate machines again now.
I don’t anticipate migrating out of OpenNebula 4.8 or upgrading
within the next year or two.

Steve Timm