VM fails to (HA) reschedule on host error - OpenNebula 5.0.1

heathen · July 13, 2016, 1:37pm

Hello!

I’m using ceph storage (both system and image storages points to the same ceph pool) and trying to set VM HA as noted in docs (with HOST_HOOK). But VM doesn’t start on another host in case of original host failure:

/var/log/one/oned.log

Wed Jul 13 09:03:09 2016 [Z0][InM][I]: Command execution fail: ‘if [ -x “/var/tmp/one/im/run_probes” ]; then /var/tmp/one/im/run_probes kvm /var/lib/one//datastores 4124
5 2 gamma; else exit 42; fi’
Wed Jul 13 09:03:09 2016 [Z0][InM][I]: ssh: connect to host gamma port 22: Connection timed out
Wed Jul 13 09:03:09 2016 [Z0][InM][I]: ExitCode: 255
Wed Jul 13 09:03:09 2016 [Z0][ONE][E]: Error monitoring Host gamma (2): -
Wed Jul 13 09:03:09 2016 [Z0][ReM][D]: Req:7520 UID:0 VirtualMachinePoolInfo invoked , -2, -1, -1, -1
Wed Jul 13 09:03:09 2016 [Z0][ReM][D]: Req:7520 UID:0 VirtualMachinePoolInfo result SUCCESS, “<VM_POOL>12<…”
Wed Jul 13 09:03:09 2016 [Z0][ReM][D]: Req:9984 UID:0 VirtualMachineInfo invoked , 12
Wed Jul 13 09:03:09 2016 [Z0][ReM][D]: Req:9984 UID:0 VirtualMachineInfo result SUCCESS, “12…”
Wed Jul 13 09:03:09 2016 [Z0][ReM][D]: Req:1136 UID:0 VirtualMachineAction invoked , “resched”, 12
Wed Jul 13 09:03:09 2016 [Z0][DiM][D]: Setting rescheduling flag on VM 12
Wed Jul 13 09:03:09 2016 [Z0][ReM][D]: Req:1136 UID:0 VirtualMachineAction result SUCCESS, 12
Wed Jul 13 09:03:09 2016 [Z0][HKM][D]: Message received: EXECUTE SUCCESS 2 error:

/var/log/one/12.log

Wed Jul 13 09:03:23 2016 [Z0][VM][I]: New LCM state is PROLOG_MIGRATE_UNKNOWN
Wed Jul 13 09:03:23 2016 [Z0][VM][I]: New state is ACTIVE
Wed Jul 13 09:03:26 2016 [Z0][TM][I]: Command execution fail: /var/lib/one/remotes/tm/ceph/mv gamma:/var/lib/one//datastores/101/12 beta:/var/lib/one//datastores/101/12 12 101
Wed Jul 13 09:03:26 2016 [Z0][TM][I]: mv: Moving gamma:/var/lib/one/datastores/101/12 to beta:/var/lib/one/datastores/101/12
Wed Jul 13 09:03:26 2016 [Z0][TM][E]: mv: Command “eval ssh gamma ‘tar -C /var/lib/one/datastores/101 --sparse -cf - 12’ | ssh beta ‘tar -C /var/lib/one/datastores/101 --sparse -xf -’” failed: ssh: connect to host gamma port 22: Connection timed out
Wed Jul 13 09:03:26 2016 [Z0][TM][I]: tar: This does not look like a tar archive
Wed Jul 13 09:03:26 2016 [Z0][TM][I]: tar: Exiting with failure status due to previous errors
Wed Jul 13 09:03:26 2016 [Z0][TM][E]: Error copying disk directory to target host
Wed Jul 13 09:03:26 2016 [Z0][TM][I]: ExitCode: 2
Wed Jul 13 09:03:26 2016 [Z0][TM][E]: Error executing image transfer script: Error copying disk directory to target host
Wed Jul 13 09:03:26 2016 [Z0][VM][I]: New LCM state is PROLOG_MIGRATE_UNKNOWN_FAILURE

So the VM ends with the FAILURE state and it’s impossible to recover it before original host went up.
While hosts are alive both live and offline migrations work well.

Could anyone please suggest me the way to debug it futher?
Thanks!

atodorov_storpool · July 13, 2016, 2:28pm

Hi,

It looks like that the CEPH TM_MAD is using the ‘mv’ script from the SSH TM_MAD.

The script is trying to move files from the dead host to the other. Obviously it is not possible

Can you try edit /var/lib/one/remotes/tm/ceph/mv
and add

[ lcm_state -eq 60 ] && exit 0

just before (~ line 69 )

log “Moving $SRC to $DST”

Kind Regards,
Anton Todorov

heathen · July 14, 2016, 4:52am

Anton,

thank you very much for your suggestion!

Now it works! The only thing is that it should looks like this (I believe the parser ate grave accent marks):

[ `lcm_state` -eq 60 ] && exit 0

Best regards,
Vladimir

atodorov_storpool · July 14, 2016, 6:58am

Hi @heathen,

I’ve made a pull request for the change. It is adding log message and is moved after the cleanup code for the $DST_PATH on the destination host.(few lines below my first suggestion)

https://github.com/OpenNebula/one/pull/106

Kind Regards,
Anton Todorov

Topic		Replies	Views
Running into errors, when a Hook is executed HA / Federation	10	878	March 5, 2025
Host in ERROR after hard reboot Community Support	1	626	August 22, 2019
OpenNebula Ceph HA for VM Community Support	2	461	December 31, 2021
Opennebula 5.4 Ceph 12 KVM troubles with VM HA Community Support	9	926	March 27, 2018
Migrate VM on host crash Community Support	5	2340	December 17, 2015

VM fails to (HA) reschedule on host error - OpenNebula 5.0.1

Related topics