Ceph Issues with Migration Between Hypervisors

I’m having issues getting ceph datastores working properly on my installation, running 5.8.0-1 on CentOS 7 with a shared file system along with three new Ceph datastores. I am having issues with migrations between hosts with TM_MAD set to “ceph”.

Here are my configurations:

Ceph SSD Datastore (101):

DATASTORE TEMPLATE                                                              
ALLOW_ORPHANS="mixed"
BRIDGE_LIST="hypervisor01 hypervisor02"
CEPH_HOST="ceph01 ceph02 ceph03 ceph04 ceph05 ceph06"
CEPH_SECRET="SECRET_KEY"
CEPH_USER="SECRET_USER"
DISK_TYPE="RBD"
DS_MIGRATE="NO"
POOL_NAME="one_ssd"
RESTRICTED_DIRS="/"
SAFE_DIRS="/var/tmp"
SHARED="YES"
TM_MAD="ceph"
TM_MAD_SYSTEM="shared"
TYPE="SYSTEM_DS"

Ceph HDD Datastore (102):

DATASTORE TEMPLATE                                                              
ALLOW_ORPHANS="mixed"
BRIDGE_LIST="hypervisor01 hypervisor02"
CEPH_HOST="ceph01 ceph02 ceph03 ceph04 ceph05 ceph06"
CEPH_SECRET="SECRET_KEY"
CEPH_USER="SECRET_USER"
DISK_TYPE="RBD"
DS_MIGRATE="NO"
POOL_NAME="one_hdd"
RESTRICTED_DIRS="/"
SAFE_DIRS="/var/tmp"
SHARED="YES"
TM_MAD="ceph"
TM_MAD_SYSTEM="shared"
TYPE="SYSTEM_DS"

Ceph Image Datastore (103):

DATASTORE TEMPLATE                                                              
ALLOW_ORPHANS="mixed"
BRIDGE_LIST="hypervisor01 hypervisor02"
CEPH_HOST="ceph01 ceph02 ceph03 ceph04 ceph05 ceph06"
CEPH_SECRET="SECRET_KEY"
CEPH_USER="SECRET_USER"
CLONE_TARGET="SYSTEM"
CLONE_TARGET_SHARED="SYSTEM"
CLONE_TARGET_SSH="SYSTEM"
DISK_TYPE="RBD"
DISK_TYPE_SHARED="RBD"
DISK_TYPE_SSH="FILE"
DRIVER="raw"
DS_MAD="ceph"
LN_TARGET="SYSTEM"
LN_TARGET_SHARED="SYSTEM"
LN_TARGET_SSH="SYSTEM"
POOL_NAME="one_images"
RESTRICTED_DIRS="/"
SAFE_DIRS="/var/tmp"
TM_MAD="ceph"
TM_MAD_SYSTEM="shared"
TYPE="IMAGE_DS"

ONED LOG:

Tue Apr 9 14:55:12 2019 [Z0][VMM][I]: premigrate: Moving hypervisor01:/var/lib/one//datastores/102/87 to hypervisor02:/var/lib/one//datastores/102/87
Tue Apr 9 14:55:12 2019 [Z0][VMM][E]: premigrate: Command "set -e -o pipefail
Tue Apr 9 14:55:12 2019 [Z0][VMM][I]: tar -C /var/lib/one//datastores/102 --sparse -cf - 87 | ssh hypervisor02 'tar -C /var/lib/one//datastores/102 --sparse -xf -'" failed: tar: 87: Cannot stat: No such file or directory
Tue Apr 9 14:55:12 2019 [Z0][VMM][I]: tar: Exiting with failure status due to previous errors
Tue Apr 9 14:55:12 2019 [Z0][VMM][E]: Error copying disk directory to target host
Tue Apr 9 14:55:12 2019 [Z0][VMM][I]: Failed to execute transfer manager driver operation: tm_premigrate.

When I set the HDD (102) and SSD (101) configurations to TM_MAD “shared” it appears to work, what am I doing wrong? Is this how it should be setup?

The other issue I’m having is even with CLONE_TARGET_* and LN_TARGET_* set to “SYSTEM” it doesn’t appear to be copying to the system datastores:

$ rbd ls -p one_images --id libvirt
one-11
one-11-87-0
$ rbd ls -p one_ssd --id libvirt
<empty>
$ rbd ls -p one_hdd --id libvirt
<empty>

Would appreciate any help, I’ve poured over documentation and must be missing something. It’s worth noting we also have a shared file system at /var/lib/one/datastores/.

Hi,

I think I understand what you’re are trying to do with your NFS at datastores, but I think that is the cause of your issue.
“tar -C /var/lib/one//datastores/102 --sparse -cf - 87 | ssh hypervisor02 ‘tar -C /var/lib/one//datastores/102 --sparse -xf -’” failed: tar: 87: Cannot stat: No such file or directory
"

FWIW here is my functional ceph (1 large cache tiered pool) system datastore:
ALLOW_ORPHANS=“YES”
BRIDGE_LIST=“cloud1.test.com
CEPH_HOST=“cephmon1.test.com cephmon2.test.com cephmon3.test.com
CEPH_SECRET=“secret "
CEPH_USER=“user”
DISK_TYPE=“RBD”
DS_MIGRATE=“YES”
POOL_NAME=“hddpool”
RBD_FORMAT=“2”
RESTRICTED_DIRS=”/"
SAFE_DIRS="/var/tmp"
SHARED=“YES”
TM_MAD=“ceph”
TYPE=“SYSTEM_DS”

Image:
ALLOW_ORPHANS=“YES”
BRIDGE_LIST=“cloud1.test.com
CEPH_HOST=“cephmon1.test.com cephmon2.test.com cephmon3.test.com
CEPH_SECRET=“secret”
CEPH_USER=“user”
CLONE_TARGET=“SELF”
DISK_TYPE=“RBD”
DRIVER=“raw”
DS_MAD=“ceph”
LN_TARGET=“NONE”
POOL_NAME=“hddpool”
RBD_FORMAT=“2”
RESTRICTED_DIRS="/"
SAFE_DIRS="/var/tmp"
TM_MAD=“ceph”
TYPE=“IMAGE_DS”

I do not have any clone or ln_target attributes, and live migration functions as expected. Give it a try without the shared dir and minimal ceph datastore configs.

The ceph TM_MAD in the SYSTEM datastore context does not support shared filesystem beneath.

IMO the best practice is to mount the shared filesystems outside of /var/lib/one/datastores and do symlinks only to the datastore ID’s that are backed by the shared TM_MAD.

something like

ln -s /sharedfilesystem/xxx /var/lib/one/datastores/xxx
ln -s /sharedfilesystem/yyy /var/lib/one/datastores/yyy

And leave the ceph datastore on local disks on the HVs not shared/ or mount the ceph datastore on different folder for each host…

Hope this helps.

Best Regards,
Anton Todorov

1 Like

Hello @Barry
I am experiencing same problem with Ceph datastore while migration VM. Please share your experience if resolve the issue.