Lizardfs datastore and sheduling error about unsupported qcow2 transfert mode

Hello.

In preparation for the migration to 5.12 from 5.8, we are finishing our new infrastructure based on Lizardfs.

We setup our new hypervisors with Lizardfs storage but we experience messages like:

Fri Jul 10 12:06:18 2020 [Z0][VM][E]: Error deploying virtual machine 380825 to HID: 15. Reason: [one.vm.deploy] Image Datastore does not support transfer mode: qcow2
Fri Jul 10 12:06:18 2020 [Z0][VM][E]: Error deploying virtual machine 380825 to HID: 14. Reason: [one.vm.deploy] Image Datastore does not support transfer mode: qcow2
Fri Jul 10 12:06:18 2020 [Z0][VM][E]: Error deploying virtual machine 380825 to HID: 13. Reason: [one.vm.deploy] Image Datastore does not support transfer mode: qcow2
Fri Jul 10 12:06:18 2020 [Z0][VM][E]: Error deploying virtual machine 380825 to HID: 16. Reason: [one.vm.deploy] Image Datastore does not support transfer mode: qcow2
Fri Jul 10 12:06:18 2020 [Z0][VM][E]: Error deploying virtual machine 380825 to HID: 17. Reason: [one.vm.deploy] Image Datastore does not support transfer mode: qcow2

We setup the transfert driver and the datastore driver.

The transition between the previous hypervisors using a SAN and the new ones was in several steps:

Images are stored on SAN but usable on new hypervisors

  • NFS mount the SAN backed image datastores (with TM_MAD=qcow2) on the new hypervisors
  • create a SHARED system datastore on the new hypervisors (backed by Lizardfs)

Here are it’s informations:

DATASTORE 107 INFORMATION
ID             : 107
NAME           : test-cluster-system
USER           : nebula
GROUP          : oneadmin
CLUSTERS       : 102
TYPE           : SYSTEM
DS_MAD         : -
TM_MAD         : shared
BASE PATH      : /var/lib/one//datastores/107
DISK_TYPE      : FILE
STATE          : READY

DATASTORE CAPACITY
TOTAL:         : 36.4T
FREE:          : 26.1T
USED:          : 10.3T
LIMIT:         : -

PERMISSIONS
OWNER          : um-
GROUP          : u--
OTHER          : ---

DATASTORE TEMPLATE
ALLOW_ORPHANS="NO"
DISK_TYPE="FILE"
DS_MIGRATE="YES"
RESTRICTED_DIRS="/"
SAFE_DIRS="/var/tmp"
SHARED="YES"
TM_MAD="shared"
TYPE="SYSTEM_DS"

This way, the new hypervisors can run VMs but the images are copied from the NFS.

New Lizardfs datastores

As it was not used before, we used the default datastore as the new Lizardfs image datastore:

DATASTORE 1 INFORMATION
ID             : 1
NAME           : default
USER           : nebula
GROUP          : oneadmin
CLUSTERS       : 102
TYPE           : IMAGE
DS_MAD         : lizardfs
TM_MAD         : lizardfs
BASE PATH      : /var/lib/one//datastores/1
DISK_TYPE      : FILE
STATE          : READY

DATASTORE CAPACITY
TOTAL:         : 36.4T
FREE:          : 26.1T
USED:          : 10.3T
LIMIT:         : -

PERMISSIONS
OWNER          : um-
GROUP          : u--
OTHER          : ---

DATASTORE TEMPLATE
ALLOW_ORPHANS="YES"
BRIDGE_LIST="nebula80 nebula81 nebula82 nebula83 nebula84"
CLONE_TARGET="SYSTEM"
CLONE_TARGET_SHARED="SYSTEM"
DISK_TYPE="FILE"
DISK_TYPE_SHARED="FILE"
DRIVER="qcow2"
DS_MAD="lizardfs"
LN_TARGET="NONE"
LN_TARGET_SHARED="NONE"
TM_MAD="lizardfs"
TM_MAD_SYSTEM="shared"
TYPE="IMAGE_DS"

and since it was not used either, we used the system datastore as the new Lizardfs system datastore:

DATASTORE 0 INFORMATION
ID             : 0
NAME           : system
USER           : nebula
GROUP          : oneadmin
CLUSTERS       : 102
TYPE           : SYSTEM
DS_MAD         : -
TM_MAD         : lizardfs
BASE PATH      : /var/lib/one//datastores/0
DISK_TYPE      : FILE
STATE          : READY

DATASTORE CAPACITY
TOTAL:         : 36.4T
FREE:          : 26.1T
USED:          : 10.3T
LIMIT:         : -

PERMISSIONS
OWNER          : um-
GROUP          : u--
OTHER          : ---

DATASTORE TEMPLATE
ALLOW_ORPHANS="YES"
DS_MIGRATE="YES"
SHARED="YES"
TM_MAD="lizardfs"
TYPE="SYSTEM_DS"

Unable to disable the test-cluster-system system datastore

Now we are ready cleanup old stuffs, I tried to disable the test-cluster-system before removing it when all VM will be migrated, but this results in the error message Error deploying virtual machine X to HID: Y. Reason: [one.vm.deploy] Image Datastore does not support transfer mode: qcow2

Do you have any suggestion of what could I have missed?

Regards.

oned.conf informations

I modified the qcow2 TM_MAD_CONF as describe in another post:

TM_MAD_CONF = [
    NAME = "qcow2", LN_TARGET = "NONE", CLONE_TARGET = "SYSTEM", SHARED = "YES",
    DRIVER = "qcow2", TM_MAD_SYSTEM = "ssh,shared",
    LN_TARGET_SSH = "SYSTEM", CLONE_TARGET_SSH = "SYSTEM", DISK_TYPE_SSH = "FILE",
    LN_TARGET_SHARED = "SYSTEM", CLONE_TARGET_SHARED = "SYSTEM", DISK_TYPE_SHARED = "FILE"
]

Here are the configuration for Lizardfs:

TM_MAD_CONF = [
    NAME = "lizardfs",
    LN_TARGET = "NONE",
    CLONE_TARGET = "SYSTEM",
    SHARED = "YES",
    DS_MIGRATE = "YES",
    ALLOW_ORPHANS = "YES",

    TM_MAD_SYSTEM = "shared",
    LN_TARGET_SHARED = "NONE",
    CLONE_TARGET_SHARED = "SYSTEM",
    DISK_TYPE_SHARED = "FILE"
]

and

DS_MAD_CONF = [
    NAME = "lizardfs",
    REQUIRED_ATTRS = "",
    PERSISTENT_ONLY = "NO",
    MARKETPLACE_ACTIONS = "export"
]

I don’t understand, because when I select the system datastore manually during vm creation, it’s working fine :-/

Digging in the code, the offending code is from src/vm/VirtualMachineDisk.cc:

    if ( ds_img->get_tm_mad_targets(tm_mad, ln_target, clone_target,
                disk_type) != 0 )
    {
        error = "Image Datastore does not support transfer mode: " + tm_mad;

        ds_img->unlock();
        return -1;
    }

I don’t understand where the tm_mad=qcow2 comes from since the default image datastore has TM_MAD=lizardfs.

I don’t manage to trace it in source code.

To me, when a VM is scheduled, there are 2 possibilities of system datastore to deploy the VM

  • system datastore which has TM_MAD=lizardfs
  • test_cluster_system datastore which has TM_MAD=shared compatible with default image datastore TM_MAD_SYSTEM=shared

I would have suppose that default image datastore attribute TM_MAD would have priority over TM_MAD_SYSTEM.

Any idea what I’m missing?

Regards.

Can you check where the VM is being deployed?. In oned.log in the Error deploying virtual machine X to HID: Y. line, does it output any datastore? Also you can look at sched.log to see where the scheduler is trying to deploy the VM, there you can look for the datastore. We want to check the tm_mad of the system ds where the VM is being deployed.

Also could you check if the VM has a TM_MAD_SYSTEM attribute set?

Hello Ruben.

Thanks a lot for your precious advices, you gave me great hints.

I was so confused that I completely forgot to check oned.log which shows the try one the 103 SYSTEM_DS which I thought was out of the game.

For the records, here are my reply to you questions.

It take 3 minutes for a VM to be deployed, which si too long for you jenkins which terminate the jobs with a timeout:

grep 382185 /var/log/one/sched.log

Mon Jul 13 21:07:42 2020 [Z0][VM][E]: Error deploying virtual machine 382185 to HID: 13. Reason: [one.vm.deploy] Image Datastore does not support transfer mode: qcow2

[4 times previous message]

Mon Jul 13 21:08:12 2020 [Z0][VM][E]: Error deploying virtual machine 382185 to HID: 13. Reason: [one.vm.deploy] Image Datastore does not support transfer mode: qcow2

[4 times previous message]

Mon Jul 13 21:08:43 2020 [Z0][VM][E]: Error deploying virtual machine 382185 to HID: 15. Reason: [one.vm.deploy] Image Datastore does not support transfer mode: qcow2

[4 times previous message]

Mon Jul 13 21:09:14 2020 [Z0][VM][E]: Error deploying virtual machine 382185 to HID: 15. Reason: [one.vm.deploy] Image Datastore does not support transfer mode: qcow2

[4 times previous message]

Mon Jul 13 21:09:45 2020 [Z0][VM][E]: Error deploying virtual machine 382185 to HID: 15. Reason: [one.vm.deploy] Image Datastore does not support transfer mode: qcow2

[4 times previous message]

Mon Jul 13 21:10:15 2020 [Z0][VM][E]: Error deploying virtual machine 382185 to HID: 15. Reason: [one.vm.deploy] Image Datastore does not support transfer mode: qcow2

[4 times previous message]

Mon Jul 13 21:10:46 2020 [Z0][VM][E]: Error deploying virtual machine 382185 to HID: 15. Reason: [one.vm.deploy] Image Datastore does not support transfer mode: qcow2

[4 times previous message]

	382185	0		15	107
grep 382185 /var/log/one/oned.log

Mon Jul 13 21:07:42 2020 [Z0][ReM][D]: Req:9184 UID:0 IP:127.0.0.1 one.vm.deploy invoked , 382185, 13, false, 103, ""
Mon Jul 13 21:07:42 2020 [Z0][ReM][D]: Req:1328 UID:0 IP:127.0.0.1 one.vm.deploy invoked , 382185, 15, false, 103, ""
Mon Jul 13 21:07:42 2020 [Z0][ReM][D]: Req:8848 UID:0 IP:127.0.0.1 one.vm.deploy invoked , 382185, 17, false, 103, ""
Mon Jul 13 21:07:42 2020 [Z0][ReM][D]: Req:9744 UID:0 IP:127.0.0.1 one.vm.deploy invoked , 382185, 16, false, 103, ""
Mon Jul 13 21:07:42 2020 [Z0][ReM][D]: Req:3904 UID:0 IP:127.0.0.1 one.vm.deploy invoked , 382185, 14, false, 103, ""

Ok, I found that another SYSTEM_DS is meddling with our setup.

We hard 3 clusters and they are now replaced by our new much more powerful LizardFS hyperconvergent one.

What I did not understand was that all the SYSTEM_DS of the new cluster are tried, even the 103 which is backed by LizardFS but with TM_MAD=qcow2 (it was setup in a huge rush before lizardfs TM_MAD script were in place because the corresponding hypervisor died).

I will define COMPATIBLE_SYS_DS to restrict which SYSTEM_DS should be tried depending of the source IMAGE_DS.

This option will permit to finally empty the test-cluster-system datastore.

Yes, it was set:

onevm show -x 382185
[...]
    <DISK>
      <ALLOW_ORPHANS><![CDATA[YES]]></ALLOW_ORPHANS>
      <CLONE><![CDATA[YES]]></CLONE>
      <CLONE_TARGET><![CDATA[SYSTEM]]></CLONE_TARGET>
      <CLUSTER_ID><![CDATA[102]]></CLUSTER_ID>
      <DATASTORE><![CDATA[default]]></DATASTORE>
      <DATASTORE_ID><![CDATA[1]]></DATASTORE_ID>
      <DEV_PREFIX><![CDATA[sd]]></DEV_PREFIX>
      <DISK_ID><![CDATA[0]]></DISK_ID>
      <DISK_SNAPSHOT_TOTAL_SIZE><![CDATA[0]]></DISK_SNAPSHOT_TOTAL_SIZE>
      <DISK_TYPE><![CDATA[FILE]]></DISK_TYPE>
      <DRIVER><![CDATA[qcow2]]></DRIVER>
      <IMAGE><![CDATA[aca.zephir-2.7.2-instance-default-amd64.vm]]></IMAGE>
      <IMAGE_ID><![CDATA[68604]]></IMAGE_ID>
      <IMAGE_STATE><![CDATA[2]]></IMAGE_STATE>
      <IMAGE_UNAME><![CDATA[jenkins]]></IMAGE_UNAME>
      <LN_TARGET><![CDATA[NONE]]></LN_TARGET>
      <ORDER><![CDATA[1]]></ORDER>
      <ORIGINAL_SIZE><![CDATA[51200]]></ORIGINAL_SIZE>
      <READONLY><![CDATA[NO]]></READONLY>
      <SAVE><![CDATA[NO]]></SAVE>
      <SIZE><![CDATA[51200]]></SIZE>
      <SOURCE><![CDATA[/var/lib/one//datastores/1/0902efe48404ef12018c35a7feb6be19]]></SOURCE>
      <TARGET><![CDATA[sda]]></TARGET>
      <TM_MAD><![CDATA[lizardfs]]></TM_MAD>
      <TM_MAD_SYSTEM><![CDATA[shared]]></TM_MAD_SYSTEM>
      <TYPE><![CDATA[FILE]]></TYPE>
    </DISK>

Maybe the scheduler could just filter out incompatible SYSTEM_DS and try only our 0 and 107 which are compatible (respectively TM_MAD=lizards and TM_MAD=shared)?

This is finally solved, thanks.

Glad to hear you solved the issue :slight_smile: