VM fails to (HA) reschedule on host error - OpenNebula 5.0.1

Hello!

I’m using ceph storage (both system and image storages points to the same ceph pool) and trying to set VM HA as noted in docs (with HOST_HOOK). But VM doesn’t start on another host in case of original host failure:

/var/log/one/oned.log

Wed Jul 13 09:03:09 2016 [Z0][InM][I]: Command execution fail: ‘if [ -x “/var/tmp/one/im/run_probes” ]; then /var/tmp/one/im/run_probes kvm /var/lib/one//datastores 4124
5 2 gamma; else exit 42; fi’
Wed Jul 13 09:03:09 2016 [Z0][InM][I]: ssh: connect to host gamma port 22: Connection timed out
Wed Jul 13 09:03:09 2016 [Z0][InM][I]: ExitCode: 255
Wed Jul 13 09:03:09 2016 [Z0][ONE][E]: Error monitoring Host gamma (2): -
Wed Jul 13 09:03:09 2016 [Z0][ReM][D]: Req:7520 UID:0 VirtualMachinePoolInfo invoked , -2, -1, -1, -1
Wed Jul 13 09:03:09 2016 [Z0][ReM][D]: Req:7520 UID:0 VirtualMachinePoolInfo result SUCCESS, “<VM_POOL>12<…”
Wed Jul 13 09:03:09 2016 [Z0][ReM][D]: Req:9984 UID:0 VirtualMachineInfo invoked , 12
Wed Jul 13 09:03:09 2016 [Z0][ReM][D]: Req:9984 UID:0 VirtualMachineInfo result SUCCESS, “12…”
Wed Jul 13 09:03:09 2016 [Z0][ReM][D]: Req:1136 UID:0 VirtualMachineAction invoked , “resched”, 12
Wed Jul 13 09:03:09 2016 [Z0][DiM][D]: Setting rescheduling flag on VM 12
Wed Jul 13 09:03:09 2016 [Z0][ReM][D]: Req:1136 UID:0 VirtualMachineAction result SUCCESS, 12
Wed Jul 13 09:03:09 2016 [Z0][HKM][D]: Message received: EXECUTE SUCCESS 2 error:

/var/log/one/12.log

Wed Jul 13 09:03:23 2016 [Z0][VM][I]: New LCM state is PROLOG_MIGRATE_UNKNOWN
Wed Jul 13 09:03:23 2016 [Z0][VM][I]: New state is ACTIVE
Wed Jul 13 09:03:26 2016 [Z0][TM][I]: Command execution fail: /var/lib/one/remotes/tm/ceph/mv gamma:/var/lib/one//datastores/101/12 beta:/var/lib/one//datastores/101/12 12 101
Wed Jul 13 09:03:26 2016 [Z0][TM][I]: mv: Moving gamma:/var/lib/one/datastores/101/12 to beta:/var/lib/one/datastores/101/12
Wed Jul 13 09:03:26 2016 [Z0][TM][E]: mv: Command “eval ssh gamma ‘tar -C /var/lib/one/datastores/101 --sparse -cf - 12’ | ssh beta ‘tar -C /var/lib/one/datastores/101 --sparse -xf -’” failed: ssh: connect to host gamma port 22: Connection timed out
Wed Jul 13 09:03:26 2016 [Z0][TM][I]: tar: This does not look like a tar archive
Wed Jul 13 09:03:26 2016 [Z0][TM][I]: tar: Exiting with failure status due to previous errors
Wed Jul 13 09:03:26 2016 [Z0][TM][E]: Error copying disk directory to target host
Wed Jul 13 09:03:26 2016 [Z0][TM][I]: ExitCode: 2
Wed Jul 13 09:03:26 2016 [Z0][TM][E]: Error executing image transfer script: Error copying disk directory to target host
Wed Jul 13 09:03:26 2016 [Z0][VM][I]: New LCM state is PROLOG_MIGRATE_UNKNOWN_FAILURE

So the VM ends with the FAILURE state and it’s impossible to recover it before original host went up.
While hosts are alive both live and offline migrations work well.

Could anyone please suggest me the way to debug it futher?
Thanks!

Hi,

It looks like that the CEPH TM_MAD is using the ‘mv’ script from the SSH TM_MAD.

The script is trying to move files from the dead host to the other. Obviously it is not possible :slight_smile:

Can you try edit /var/lib/one/remotes/tm/ceph/mv
and add

[ lcm_state -eq 60 ] && exit 0

just before (~ line 69 )

log “Moving $SRC to $DST”

Kind Regards,
Anton Todorov

1 Like

Anton,

thank you very much for your suggestion!

Now it works! The only thing is that it should looks like this (I believe the parser ate grave accent marks):

[ `lcm_state` -eq 60 ] && exit 0

Best regards,
Vladimir

Hi @heathen,

I’ve made a pull request for the change. It is adding log message and is moved after the cleanup code for the $DST_PATH on the destination host.(few lines below my first suggestion)

https://github.com/OpenNebula/one/pull/106

Kind Regards,
Anton Todorov

1 Like