(BOOT_MIGRATE_FAILURE) during datastore migration with OpenNebula 4.14.0, BUG?

Hi everyone,

I encountered some problems with DS Migration Between two System Datastore.

1- BOOT_MIGRATE_FAILURE Problem :

This is my current configuration :

  • The “Default” (0) and “Image” (1) datastores are an ISCSI SAN Backend.
  • The “Gluster-Image” (100) is linked (ln -s) with “Gluster-System” (102) and are a GlusterFS Backend.

When I’m trying to Migrate a VM (with non-persistent disk) I get a BOOT_MIGRATE_FAILURE
I get the same problem when I’m trying to do “GlusterFS -> ISCSI Backend” and “SCSI -> GlusterFS Backend”

In this example, I migrate from "GlusterFS to ISCSI Backend, This is the logs :

Wed Dec 23 12:52:19 2015 [Z0][VMM][I]: Command execution fail: /var/tmp/one/vmm/kvm/restore '/var/lib/one//datastores/0/234/checkpoint' 'onehost-test-kvm02' 'one-234' 234 onehost-test-kvm02
Wed Dec 23 12:52:19 2015 [Z0][VMM][E]: restore: Command "virsh --connect qemu:///system restore /var/lib/one//datastores/0/234/checkpoint" failed: error: Failed to restore domain from /var/lib/one//datastores/0/234/checkpoint
Wed Dec 23 12:52:19 2015 [Z0][VMM][I]: error: Cannot access storage file '/var/lib/one//datastores/102/234/disk.1' (as uid:1000, gid:1000): No such file or directory
Wed Dec 23 12:52:19 2015 [Z0][VMM][E]: Could not restore from /var/lib/one//datastores/0/234/checkpoint
Wed Dec 23 12:52:19 2015 [Z0][VMM][I]: ExitCode: 1
Wed Dec 23 12:52:19 2015 [Z0][VMM][I]: Failed to execute virtualization driver operation: restore.
Wed Dec 23 12:52:19 2015 [Z0][VMM][E]: Error restoring VM: Could not restore from /var/lib/one//datastores/0/234/checkpoint
Wed Dec 23 12:52:19 2015 [Z0][VM][I]: New LCM state is BOOT_MIGRATE_FAILURE

This is the content of “/var/lib/one//datastores/0/234/checkpoint” after Migration :

<name>one-234</name>
...
<system_datastore>/var/lib/one//datastores/102/234</system_datastore>
...
<source file='/var/lib/one//datastores/102/234/disk.1'/>

The path should be “/var/lib/one//datastores/0/234”.

2- Migration Problem after multiple click on “Migrate” option :

When I go more than one time in “Migrate” option, there is a Bug.

  • In top of the window there is several entries like this depending on how many time I click.
    Exemple, If i go three time, I get this:
    VM [207] is currently running on Host [onehost-test-kvm02]
    VM [207] is currently running on Host [onehost-test-kvm02]
    VM [234] is currently running on Host [onehost-test-kvm02]

Thanks in advance for your reply,

Sincerly,

Hi,

Could you provide the output of onedatastore list -x to get familiar wit your datastore setup?

Kind Regards,
Anton Todorov

Here is the output of onedatastore list -x :

<DATASTORE_POOL>
  <DATASTORE>
    <ID>0</ID>
    <UID>0</UID>
    <GID>0</GID>
    <UNAME>oneadmin</UNAME>
    <GNAME>oneadmin</GNAME>
    <NAME>system</NAME>
    <PERMISSIONS>
      <OWNER_U>1</OWNER_U>
      <OWNER_M>1</OWNER_M>
      <OWNER_A>0</OWNER_A>
      <GROUP_U>1</GROUP_U>
      <GROUP_M>0</GROUP_M>
      <GROUP_A>0</GROUP_A>
      <OTHER_U>0</OTHER_U>
      <OTHER_M>0</OTHER_M>
      <OTHER_A>0</OTHER_A>
    </PERMISSIONS>
    <DS_MAD><![CDATA[-]]></DS_MAD>
    <TM_MAD><![CDATA[shared]]></TM_MAD>
    <BASE_PATH><![CDATA[/var/lib/one//datastores/0]]></BASE_PATH>
    <TYPE>1</TYPE>
    <DISK_TYPE>0</DISK_TYPE>
    <STATE>0</STATE>
    <CLUSTER_ID>-1</CLUSTER_ID>
    <CLUSTER/>
    <TOTAL_MB>1025276</TOTAL_MB>
    <FREE_MB>953856</FREE_MB>
    <USED_MB>71421</USED_MB>
    <IMAGES/>
    <TEMPLATE>
      <BASE_PATH><![CDATA[/var/lib/one//datastores/]]></BASE_PATH>
      <SHARED><![CDATA[YES]]></SHARED>
      <TM_MAD><![CDATA[shared]]></TM_MAD>
      <TYPE><![CDATA[SYSTEM_DS]]></TYPE>
    </TEMPLATE>
  </DATASTORE>
  <DATASTORE>
    <ID>1</ID>
    <UID>0</UID>
    <GID>0</GID>
    <UNAME>oneadmin</UNAME>
    <GNAME>oneadmin</GNAME>
    <NAME>default</NAME>
    <PERMISSIONS>
      <OWNER_U>1</OWNER_U>
      <OWNER_M>1</OWNER_M>
      <OWNER_A>0</OWNER_A>
      <GROUP_U>1</GROUP_U>
      <GROUP_M>0</GROUP_M>
      <GROUP_A>0</GROUP_A>
      <OTHER_U>0</OTHER_U>
      <OTHER_M>0</OTHER_M>
      <OTHER_A>0</OTHER_A>
    </PERMISSIONS>
    <DS_MAD><![CDATA[fs]]></DS_MAD>
    <TM_MAD><![CDATA[shared]]></TM_MAD>
    <BASE_PATH><![CDATA[/var/lib/one//datastores/1]]></BASE_PATH>
    <TYPE>0</TYPE>
    <DISK_TYPE>0</DISK_TYPE>
    <STATE>0</STATE>
    <CLUSTER_ID>-1</CLUSTER_ID>
    <CLUSTER/>
    <TOTAL_MB>1025276</TOTAL_MB>
    <FREE_MB>953856</FREE_MB>
    <USED_MB>71421</USED_MB>
    <IMAGES>
      <ID>3</ID>
      <ID>9</ID>
      <ID>10</ID>
      <ID>29</ID>
      <ID>31</ID>
      <ID>38</ID>
      <ID>39</ID>
      <ID>42</ID>
      <ID>49</ID>
      <ID>73</ID>
      <ID>74</ID>
      <ID>77</ID>
      <ID>81</ID>
      <ID>82</ID>
      <ID>84</ID>
      <ID>95</ID>
    </IMAGES>
    <TEMPLATE>
      <BASE_PATH><![CDATA[/var/lib/one//datastores/]]></BASE_PATH>
      <CLONE_TARGET><![CDATA[SYSTEM]]></CLONE_TARGET>
      <DISK_TYPE><![CDATA[FILE]]></DISK_TYPE>
      <DS_MAD><![CDATA[fs]]></DS_MAD>
      <LN_TARGET><![CDATA[NONE]]></LN_TARGET>
      <TM_MAD><![CDATA[shared]]></TM_MAD>
      <TYPE><![CDATA[IMAGE_DS]]></TYPE>
    </TEMPLATE>
  </DATASTORE>
  <DATASTORE>
    <ID>2</ID>
    <UID>0</UID>
    <GID>0</GID>
    <UNAME>oneadmin</UNAME>
    <GNAME>oneadmin</GNAME>
    <NAME>files</NAME>
    <PERMISSIONS>
      <OWNER_U>1</OWNER_U>
      <OWNER_M>1</OWNER_M>
      <OWNER_A>0</OWNER_A>
      <GROUP_U>1</GROUP_U>
      <GROUP_M>0</GROUP_M>
      <GROUP_A>0</GROUP_A>
      <OTHER_U>0</OTHER_U>
      <OTHER_M>0</OTHER_M>
      <OTHER_A>0</OTHER_A>
    </PERMISSIONS>
    <DS_MAD><![CDATA[fs]]></DS_MAD>
    <TM_MAD><![CDATA[ssh]]></TM_MAD>
    <BASE_PATH><![CDATA[/var/lib/one//datastores/2]]></BASE_PATH>
    <TYPE>2</TYPE>
    <DISK_TYPE>0</DISK_TYPE>
    <STATE>0</STATE>
    <CLUSTER_ID>-1</CLUSTER_ID>
    <CLUSTER/>
    <TOTAL_MB>1025276</TOTAL_MB>
    <FREE_MB>953856</FREE_MB>
    <USED_MB>71421</USED_MB>
    <IMAGES/>
    <TEMPLATE>
      <BASE_PATH><![CDATA[/var/lib/one//datastores/]]></BASE_PATH>
      <CLONE_TARGET><![CDATA[SYSTEM]]></CLONE_TARGET>
      <DS_MAD><![CDATA[fs]]></DS_MAD>
      <LN_TARGET><![CDATA[SYSTEM]]></LN_TARGET>
      <TM_MAD><![CDATA[ssh]]></TM_MAD>
      <TYPE><![CDATA[FILE_DS]]></TYPE>
    </TEMPLATE>
  </DATASTORE>
  <DATASTORE>
    <ID>100</ID>
    <UID>0</UID>
    <GID>0</GID>
    <UNAME>oneadmin</UNAME>
    <GNAME>oneadmin</GNAME>
    <NAME>gluster-image</NAME>
    <PERMISSIONS>
      <OWNER_U>1</OWNER_U>
      <OWNER_M>1</OWNER_M>
      <OWNER_A>0</OWNER_A>
      <GROUP_U>1</GROUP_U>
      <GROUP_M>0</GROUP_M>
      <GROUP_A>0</GROUP_A>
      <OTHER_U>0</OTHER_U>
      <OTHER_M>0</OTHER_M>
      <OTHER_A>0</OTHER_A>
    </PERMISSIONS>
    <DS_MAD><![CDATA[fs]]></DS_MAD>
    <TM_MAD><![CDATA[shared]]></TM_MAD>
    <BASE_PATH><![CDATA[/var/lib/one//datastores/100]]></BASE_PATH>
    <TYPE>0</TYPE>
    <DISK_TYPE>0</DISK_TYPE>
    <STATE>0</STATE>
    <CLUSTER_ID>-1</CLUSTER_ID>
    <CLUSTER/>
    <TOTAL_MB>64380</TOTAL_MB>
    <FREE_MB>46155</FREE_MB>
    <USED_MB>14933</USED_MB>
    <IMAGES/>
    <TEMPLATE>
      <BASE_PATH><![CDATA[/var/lib/one//datastores/]]></BASE_PATH>
      <CLONE_TARGET><![CDATA[SYSTEM]]></CLONE_TARGET>
      <DATASTORE_CAPACITY_CHECK><![CDATA[NO]]></DATASTORE_CAPACITY_CHECK>
      <DISK_TYPE><![CDATA[FILE]]></DISK_TYPE>
      <DS_MAD><![CDATA[fs]]></DS_MAD>
      <LN_TARGET><![CDATA[NONE]]></LN_TARGET>
      <TM_MAD><![CDATA[shared]]></TM_MAD>
      <TYPE><![CDATA[IMAGE_DS]]></TYPE>
    </TEMPLATE>
  </DATASTORE>
  <DATASTORE>
    <ID>102</ID>
    <UID>0</UID>
    <GID>0</GID>
    <UNAME>oneadmin</UNAME>
    <GNAME>oneadmin</GNAME>
    <NAME>gluster-system</NAME>
    <PERMISSIONS>
      <OWNER_U>1</OWNER_U>
      <OWNER_M>1</OWNER_M>
      <OWNER_A>0</OWNER_A>
      <GROUP_U>1</GROUP_U>
      <GROUP_M>0</GROUP_M>
      <GROUP_A>0</GROUP_A>
      <OTHER_U>0</OTHER_U>
      <OTHER_M>0</OTHER_M>
      <OTHER_A>0</OTHER_A>
    </PERMISSIONS>
    <DS_MAD><![CDATA[-]]></DS_MAD>
    <TM_MAD><![CDATA[shared]]></TM_MAD>
    <BASE_PATH><![CDATA[/var/lib/one//datastores/102]]></BASE_PATH>
    <TYPE>1</TYPE>
    <DISK_TYPE>0</DISK_TYPE>
    <STATE>0</STATE>
    <CLUSTER_ID>-1</CLUSTER_ID>
    <CLUSTER/>
    <TOTAL_MB>64380</TOTAL_MB>
    <FREE_MB>46155</FREE_MB>
    <USED_MB>14933</USED_MB>
    <IMAGES/>
    <TEMPLATE>
      <BASE_PATH><![CDATA[/var/lib/one//datastores/]]></BASE_PATH>
      <DATASTORE_CAPACITY_CHECK><![CDATA[NO]]></DATASTORE_CAPACITY_CHECK>
      <SHARED><![CDATA[YES]]></SHARED>
      <TM_MAD><![CDATA[shared]]></TM_MAD>
      <TYPE><![CDATA[SYSTEM_DS]]></TYPE>
    </TEMPLATE>
  </DATASTORE>
</DATASTORE_POOL>

Hi Sébastien,

I can’t spot issue with the configuration of the datastores. I’ve tried to reproduce the issue but at least on the latest stable 4.14.2 it is not possible to reproduce it.

Are you running older version of OpenNebula?

Kind Regards,
Anton Todorov

Hi Anton,

I’m using OpenNebula 4.14.0.

The configuration from opennebula datastore is like this :

  ID NAME                SIZE AVAIL CLUSTER      IMAGES TYPE DS      TM      STAT
   0 system           1001.2G 93%   -                 0 sys  -       shared  on  
   1 default          1001.2G 93%   -                16 img  fs      shared  on  
   2 files            1001.2G 93%   -                 0 fil  fs      ssh     on  
 100 gluster-image      62.9G 72%   -                 0 img  fs      shared  on  
 102 gluster-system      62.9G 72%   -                 0 sys  -       shared  on 

The backend datastores are this one :
0 (System) = ISCSI SAN mounted on /var/lib/one/datastores/0
1 (Default) = ISCSI SAN mounted on /var/lib/one/datastores/1
100 (gluster-image) = GlusterFS Volume mounted on /var/lib/one/datastores/100
102 (gluster-system) = link (ln) /var/lib/one/datastores/102 -> /var/lib/one/datastores/100

Kind Regards,

Hi Sébastien,

The manipulation of the checkpoint XML is done in /var/lib/one/remotes/vmm/kvm/restore script.

Which is same on 4.14.0 and 4.14.2.

It is possible that the symlink is createing problem. If in the gluster-image datastore there is already directory with the VM_ID as name. But in this case the error log will say that the context.xml could not be found. (I hit this case but it was a bug in our addon because the destination directory was created in advance in our driver)

In this case the mv(/var/lib/one/remotes/tm/shared/mv) script was creating additional subdir during the move. But as I said there is definitely another error message logged.

Can you separate the gluster-system from the gluster-image? something like:

mkdir /var/lib/one/datastores/100/gluster-system
ln -s /var/lib/one/datastores/100/gluster-system /var/lib/one/datastores/102

This way we will be sure that there is no such issues.

Kind Regards,
Anton Todorov

Hi Sébastien,

Another hint just hit me if you do upgrade from older OpenNebula - please check is the vmm/kvm/restore script synced with 4.14.0 release:
/var/lib/one/remotes/vmm/kvm/restore on the front-end and /var/tmp/one/vmm/kvm/restore on the kvm hosts must be the same. Otherwise sync them from the front-end:

su - oneadmin
onehost sync --force

Kind Regards,
Anton Todorov

Good find Anton,

Effectively, I haven’t the same restore file.
I launched a sync. I will try again and let you know.

Just another question, should I define a cluster when using two different system datastore are this is not a problem ?

Many thanks,

Hi Sébastien,

It should be faster to spot it, and while reproducng the case I’ve hit a bug in our addon and another in OpenNebula - issue #4269.

About the use of the clusters it depend on the given setup and what is needed to separate different resources.

I am mostly developing and as I have different configurations for datastores and I use the cluster definition to select and use only the entities that are needed for the current test against our addon so I am not very good source for suggestion is the cluster definition needed or not. IMO if there is relatively small setup and there are no much different services offered without cluster definition sounds ok but anyway if you want to know exactly which resources are given to the customers - the cluster definition is a must.

Kind Regards,
Anton Todorov

It depends on the

Hi Anton, Happy new year :slightly_smiling:

Effectively, I confirm that sync solved the Problem :slightly_smiling:

I have some more questions regarding images and Datastore :

1 - When I made a migration DS with a non-persistent image, there is no way to see in which datastore the Image is saved.
There is a roadmap about that point ?

I still have some misunderstanding about persistent and non-persistent Image regarding the different possibility (Snapshot, Image Type “qcow2, raw”,…)

In the case you want to give the possibility to your end customer to define their own VM with their specific disk size you must define a non-persistent disk, however with non-persistent image, if there is any problem about Hypervisor or Infrastructure your have no guarantee that you will not loose any data.

How would you plan your installation in that case ?

Kind Regards,
LEFEUVRE Sébastien

Hi Sébastien, Happy new year too :slightly_smiling:

First of all I would like to clarify that primary I am working only on our addon, but to do perfect compatibility I need to dig deep in the OpenNebula’s core for better understanding of how it works. So the following statements are mostly my understanding how things work.

Lets first sync our understandings regarding the disk images. There are primary different types of disk images regarding their role and where they are defined and where are during VM run:

  • Context image - it is an ISO file. AFAIK there is only one such image for each VM. Mostly this image is the courier to pass the configuration parameters to the VM - IP addresses, hostname, files from the FILES DS, etc. The image is created in the SYSTEM DS (the VM home directory). Via our addon it is possible to place them on the StorPool storage and create symlinks to them in the SYSTEM DS

  • Volatile images. In almost all storage drivers they are created as files in the SYSTEM DS and destroyed when VM is destroyed (shutdown… you know - everything under the red button is destroing the VM. IMO in 4.14 they all mean same - I can’t find the deferred snapshot functionality from pre-4.14 versions of OpenNebula). There is no snapshot functionality on them. Again, via our addon it is possible to place them on the StorPool storage and create symlinks to them in the SYSTEM DS.

  • Persistent Images. They are defined in the IMAGES datastore and are “served” to the visualization process in the hypervisor host via the datastore’s TM_MAD. Again depending on the driver they are moved from the IMAGES DS to the VM home in the SYSTEM DS. In our addon they are on StorPool storage (and yes, symlinks to them :slight_smile: )

  • Non-persistent Images. They are copied/cloned from the “master” image that live in the IMAGES DS. (well symlinked same as all other types)

AFAIK the datastore migration is only for the SYSTEM DS. So when you migrate VM only a change of the VM “home” directory is done to another SYSTEM DS at the hypervisor host. When you “Undeploy” a VM all images are “transferred” back to the IMAGES DS. And then on “Deploy” they are transferred to the SYSTEM DS on the hypervisor host. Please note that the latter scenario is not working as expected due to a bug(#4271).

You can find the OpenNebula’s roadmap at the dev portal.

There is no big difference between persistent and non-persistent images. The functionality depends on their underlying storage capability and what is implemented in the corresponding TM_MAD. If the underlying storage and TM_MAD support snapshots, then both have snapshots with slight differences - the snapshots of the non-persistent images are deleted when the VM is destroyed. The snapshots on the persistent images are kept. But on both you can “save” a snapshot as new image in the IMAGES DS :wink:

If I correctly understand you are asking for the case when the hypervisor stop(like hardware/power failure/)?
This is a good question and it took me most of the development effort to have a solution for this case. Again it depends on the underlying storage and the capability of the storage driver.

If the disk images are on local storage (TM_MAD SSH) the VM data is on the hypervisors disks and if you have luck there is a possibility to transfer disks/data to another hypervisor and run the VM there.

If the disk images are on shared storage - well it depends on the storage :slight_smile: for files on shared storage (read NFS and similar) there is no issue. For block devices it depends on the TM_MAD. As our addon has support of all disk images the only task was how to properly “restore” the VM on another host. For this situation there is a custom Fail Tolerance script that is hooked to the HOST_ERROR event to migrate the VM and its disks to another host. The task is impossible in pre-4.14, extremely tricky on 4.14 and hopefully easier and faster on 5.x (feature #3958)

I hope this will give some light on the disks and images :slightly_smiling:

Kind Regards,
Anton Todorov