Drbd datastore live migration fail

i am using Debian 9, opennebula 5.4.6 and drbdadm 9.2.0 with drbd driver from github. everything works correctly: i can download images and start VMs on the drbd datastore. But when i try to migrate (live or non-live). it fails. here are the output:

onedatastore show 107:

DATASTORE 107 INFORMATION                                                       
ID             : 107                 
NAME           : drbdmanage_redundant
USER           : oneadmin            
GROUP          : oneadmin            
CLUSTERS       : 0,100               
TYPE           : IMAGE               
DS_MAD         : drbdmanage          
TM_MAD         : drbdmanage          
BASE PATH      : /var/lib/one//datastores/107
DISK_TYPE      : FILE                
STATE          : READY               

DATASTORE CAPACITY                                                              
TOTAL:         : 3.8T                
FREE:          : 3.7T                
USED:          : 0M                  
LIMIT:         : -                   

PERMISSIONS                                                                     
OWNER          : um-                 
GROUP          : u--                 
OTHER          : ---                 

DATASTORE TEMPLATE                                                              
ALLOW_ORPHANS="NO"
BRIDGE_LIST="virt1 virt2"
CLONE_TARGET="SELF"
DISK_TYPE="FILE"
DRBD_REDUNDANCY="2"
DRBD_SUPPORT_LIVE_MIGRATION="yes"
DS_MAD="drbdmanage"
LN_TARGET="NONE"
RESTRICTED_DIRS="/"
SAFE_DIRS="/var/tmp"
TM_MAD="drbdmanage"

IMAGES         
32  

log on live migrate:

Tue Jan 30 15:14:39 2018 [Z0][VM][I]: New LCM state is RUNNING
Tue Jan 30 15:19:25 2018 [Z0][VM][I]: New LCM state is MIGRATE
Tue Jan 30 15:19:25 2018 [Z0][VMM][I]: Successfully execute transfer manager driver operation: tm_premigrate.
Tue Jan 30 15:19:25 2018 [Z0][VMM][I]: Successfully execute network driver operation: pre.
Tue Jan 30 15:19:26 2018 [Z0][VMM][I]: Command execution fail: cat << EOT | /var/tmp/one/vmm/kvm/migrate 'one-38' 'virt1' 'virt2' 38 virt2
Tue Jan 30 15:19:26 2018 [Z0][VMM][E]: migrate: Command "virsh --connect qemu:///system migrate --live one-38 qemu+ssh://virt1/system" failed: error: Cannot access storage file '/var/lib/one//datastores/0/38/disk.1' (as uid:9869, gid:9869): No such file or directory
Tue Jan 30 15:19:26 2018 [Z0][VMM][E]: Could not migrate one-38 to virt1
Tue Jan 30 15:19:26 2018 [Z0][VMM][I]: ExitCode: 1
Tue Jan 30 15:19:26 2018 [Z0][VMM][I]: Successfully execute transfer manager driver operation: tm_failmigrate.
Tue Jan 30 15:19:26 2018 [Z0][VMM][I]: Failed to execute virtualization driver operation: migrate.
Tue Jan 30 15:19:26 2018 [Z0][VMM][E]: Error live migrating VM: Could not migrate one-38 to virt1
Tue Jan 30 15:19:26 2018 [Z0][VM][I]: New LCM state is RUNNING
Tue Jan 30 15:19:26 2018 [Z0][LCM][I]: Fail to live migrate VM. Assuming that the VM is still RUNNING (will poll VM).

log on migrate:

Tue Jan 30 15:21:10 2018 [Z0][VM][I]: New LCM state is SAVE_MIGRATE
Tue Jan 30 15:21:12 2018 [Z0][VMM][I]: /var/tmp/one/vmm/kvm/save: line 58: warning: command substitution: ignored null byte in input
Tue Jan 30 15:21:12 2018 [Z0][VMM][I]: ExitCode: 0
Tue Jan 30 15:21:12 2018 [Z0][VMM][I]: Successfully execute virtualization driver operation: save.
Tue Jan 30 15:21:12 2018 [Z0][VMM][I]: Successfully execute network driver operation: clean.
Tue Jan 30 15:21:12 2018 [Z0][VM][I]: New LCM state is PROLOG_MIGRATE
Tue Jan 30 15:21:23 2018 [Z0][VM][I]: New LCM state is BOOT_MIGRATE
Tue Jan 30 15:21:23 2018 [Z0][VMM][I]: Successfully execute transfer manager driver operation: tm_context.
Tue Jan 30 15:21:23 2018 [Z0][VMM][I]: Successfully execute network driver operation: pre.
Tue Jan 30 15:21:24 2018 [Z0][VMM][I]: Command execution fail: cat << EOT | /var/tmp/one/vmm/kvm/restore '/var/lib/one//datastores/0/38/checkpoint' 'virt1' 'one-38' 38 virt1
Tue Jan 30 15:21:24 2018 [Z0][VMM][I]: /var/tmp/one/vmm/kvm/restore: line 43: warning: command substitution: ignored null byte in input
Tue Jan 30 15:21:24 2018 [Z0][VMM][E]: restore: Command "virsh --connect qemu:///system restore /var/lib/one//datastores/0/38/checkpoint --xml /var/lib/one//datastores/0/38/checkpoint.xml" failed: error: Failed to restore domain from /var/lib/one//datastores/0/38/checkpoint
Tue Jan 30 15:21:24 2018 [Z0][VMM][I]: error: Cannot access storage file '/var/lib/one//datastores/0/38/disk.0' (as uid:9869, gid:9869): No such file or directory
Tue Jan 30 15:21:24 2018 [Z0][VMM][E]: Could not restore from /var/lib/one//datastores/0/38/checkpoint
Tue Jan 30 15:21:24 2018 [Z0][VMM][I]: ExitCode: 1
Tue Jan 30 15:21:24 2018 [Z0][VMM][I]: Failed to execute virtualization driver operation: restore.
Tue Jan 30 15:21:24 2018 [Z0][VMM][E]: Error restoring VM: Could not restore from /var/lib/one//datastores/0/38/checkpoint
Tue Jan 30 15:21:24 2018 [Z0][VM][I]: New LCM state is BOOT_MIGRATE_FAILURE

after live migrate, i inspect the destination server location “/var/lib/one/datastore/0/38” and it is not there

after migrate, i inspect the destination server, this time the folder is created “/var/lib/one/datastore/38” and it contains files including disk.1 :

-rw-r--r-- 1 oneadmin oneadmin 185040260 janv. 30 15:21 checkpoint
-rw-r--r-- 1 oneadmin oneadmin      2119 janv. 30 15:21 checkpoint.xml
-rw-r--r-- 1 oneadmin oneadmin       862 janv. 30 15:14 deployment.0
-rw-r--r-- 1 oneadmin oneadmin    372736 janv. 30 15:21 disk.1

but “disk.0” link to drbd device is not created.

can you provide any insight?

thank you in advance

i have solved this issue myself. the problem was the system datastore was not shared. to do this I followed these steps:

  • export /var/lib/one as NFS share on controller
  • mount /var/lib/one share from controller on all nodes
  • delete and recreate system datastore with backend “Filesystem shared” mode

migration and live migration then works as all system datastore disk links are present on all nodes

i hope this helps others