In my OpenNebula 4.8 installation I have one Image datastore on NFS and one System datastore which
is local to the disk.
[root@fclheadgpvm01 init]# onedatastore show 102
DATASTORE 102 INFORMATION
ID : 102
NAME : cloud_images
USER : oneadmin
GROUP : oneadmin
CLUSTER : -
TYPE : IMAGE
DS_MAD : fs
TM_MAD : shared
BASE PATH : /var/lib/one/datastores/102
DISK_TYPE : FILE
[root@fclheadgpvm01 init]# onedatastore show 104
DATASTORE 104 INFORMATION
ID : 104
NAME : localnode_fcl
USER : oneadmin
GROUP : oneadmin
CLUSTER : fcl
TYPE : SYSTEM
DS_MAD : -
TM_MAD : ssh
BASE PATH : /var/lib/one/datastores/104
DISK_TYPE : FILE
I have two types of imageâPersistent images use the LINK primitive to get started
and they run directly off of the NFS store. They can migrate just fine.
Non-persistent images use the CLONE primitive to make a copy to local
disk which is on /var/lib/one/datastores/104 on each local disk but
not shared between the various VM hosts. Onevm migrate on these type
of VMâs does not work.
Error is:
ERROR=âTue Apr 14 16:12:34 2015 : Error restoring VM: Could not restore from /var/lib/one//datastores/104/1669/checkpointâ
At this point the VM goes to FAILED state and is unrecoverable
[root@fclheadgpvm01 init]# cat /var/log/one/1669.log-20150415
Tue Apr 14 16:10:56 2015 [Z0][LCM][I]: New VM state is SAVE_MIGRATE
Tue Apr 14 16:11:06 2015 [Z0][VMM][I]: ExitCode: 0
Tue Apr 14 16:11:06 2015 [Z0][VMM][I]: Successfully execute virtualization driver operation: save.
Tue Apr 14 16:11:06 2015 [Z0][VMM][I]: ExitCode: 0
Tue Apr 14 16:11:06 2015 [Z0][VMM][I]: Successfully execute network driver operation: clean.
Tue Apr 14 16:11:06 2015 [Z0][LCM][I]: New VM state is PROLOG_MIGRATE
Tue Apr 14 16:12:33 2015 [Z0][LCM][I]: New VM state is BOOT
Tue Apr 14 16:12:34 2015 [Z0][VMM][I]: ExitCode: 0
Tue Apr 14 16:12:34 2015 [Z0][VMM][I]: Successfully execute network driver operation: pre.
Tue Apr 14 16:12:34 2015 [Z0][VMM][I]: Command execution fail: /var/tmp/one/vmm/kvm/restore â/var/lib/one//datastores/104/1669/checkpointâ âfcl321â âone-1669â 1669 fcl321
Tue Apr 14 16:12:34 2015 [Z0][VMM][E]: restore: Command âvirsh --connect qemu:///system restore /var/lib/one//datastores/104/1669/checkpointâ failed: error: Failed to restore domain from /var/lib/one//datastores/104/1669/checkpoint
Tue Apr 14 16:12:34 2015 [Z0][VMM][I]: error: Unable to read from monitor: Connection reset by peer
Tue Apr 14 16:12:34 2015 [Z0][VMM][E]: Could not restore from /var/lib/one//datastores/104/1669/checkpoint
Tue Apr 14 16:12:34 2015 [Z0][VMM][I]: ExitCode: 1
Tue Apr 14 16:12:34 2015 [Z0][VMM][I]: Failed to execute virtualization driver operation: restore.
Tue Apr 14 16:12:34 2015 [Z0][VMM][E]: Error restoring VM: Could not restore from /var/lib/one//datastores/104/1669/checkpoint
Tue Apr 14 16:12:34 2015 [Z0][DiM][I]: New VM state is FAILED
Tue Apr 14 16:35:10 2015 [Z0][DiM][I]: New VM state is DONE.
Tue Apr 14 16:35:11 2015 [Z0][TM][W]: Ignored: TRANSFER SUCCESS 1669 -
Any suggestions on how to make this work?
I know I could create a second Image store that is ssh-based but would like to do it
with just one image store if possible.
Steve Timm
Under OpenNebula3 we were able to do migrations of this kind but we had to
hack the transfer manager scripts to make this work.
The non live migrate should work with ssh drivers as is. Can you check the file /var/log/libvirt/qemu/one-1669.log to get more information on what failed? virsh command is not giving enough information.
Javiâthe contents of the 1669.log were appended to the original forum post.
The ssh drivers should work OK as you say but in an NFS clone situation like that I believe that the
ssh drivers are not being used.
I think what you need are the contents of transfer.1.migrate I include
it below.
In any case the reason that the virsh restore failed to restore from the checkpoint file, is that
the checkpoint file is not there, it did not get copied from node A to node B.
I tried again and I think I see the problem, but not how to fix it.
All the files were indeed copied from local /var/lib/one/datastores/104/
on one node to the same local directory on the other node. but they are the wrong permissions.
And thus the error we are getting is a âpermission deniedâ
bash-4.1$ virsh --connect qemu:///system restore /var/lib/one/datastores/104/1853/checkpoint
error: Failed to restore domain from /var/lib/one/datastores/104/1853/checkpoint
error: internal error process exited while connecting to monitor: char device redirected to /dev/pts/6
2015-04-22T14:16:41.443058Z qemu-kvm: -drive file=/var/lib/one//datastores/104/1853/disk.0,if=none,id=drive-virtio-disk0,format=qcow2: could not open disk image /var/lib/one//datastores/104/1853/disk.0: Permission denied
Basically the files are losing the needed permissions to launch when they get scpâed from
one node to the other. Even modifying the ssh driver to do an scp -p
would work. If I change the permissions manually of the disk.0 and checkpoint files
I can manualy restart the VM with the same permissionis.
I remind you that my qemu.conf has
dynamic_ownership = 0
so all my vmâs are running as the qemu user which is part of the
oneadmin group. thatâs why the 666 and 664 permissions are necessary at the moment.
Hi AntonâI checked and on all hosts in question the umask is already 0002
for the oneadmin user.
-bash-4.1$ echo âthis is testâ >> foo
-bash-4.1$ ls -al foo
-rw-rw-r-- 1 oneadmin oneadmin 14 Apr 22 09:53 foo
-bash-4.1$ hostname fcl013.fnal.gov
-bash-4.1$ scp foo oneadmin@fcl002:~/foo002
foo 100% 14 0.0KB/s
00:00
-bash-4.1$ umask
0002
exec_and_log âeval $TAR_COPYâ âError copying disk directory to target hostâ
All files with their attributes are packed in a tar container which is passed to the destination host.
I am almost sure somewhere on the way there is umask 0022 enforced. I can not see another explanation for attributes change.
[timm@snowball imaging]$ ssh oneadmin@fcl013
Last login: Wed Apr 22 10:21:08 2015 from 131.225.80.124
-bash-4.1$ umask
0002
So now all I have to figure out is why the system is setting the umask for the non-interactive
shell to 0022. Do we know in the case of an scpâdoes either bashrc or profile get executed?
/etc/profile is clearly for the login shell.
This code in /etc/bashrc seems to indicate that for user oneadmin (uid 44897 gid 10040,
oneadmin:oneadmin, that the umask should be set to 0002 but it does not appear to be
doing so.
if [ $UID -gt 199 ] && [ "`id -gn`" = "`id -un`" ]; then
umask 002
else
umask 022
fi
PSâI see from the above that opennebula is not doing an scpâbut the same
issue still applies whether it is doing a file-by-file scp copy or a non-interactive login
ssh section. the umask is not getting set to the value we want because neither /etc/profile
or /etc/bashrc are being executed. Suggestions?
Thanks AntonâŚ
Modifying tm/ssh/mv worked. I am able to migrate machines again now.
I donât anticipate migrating out of OpenNebula 4.8 or upgrading
within the next year or two.