Opennebula 5.0 Beta - Host Hook

Hi,

my setup includes three physical servers. one FE + two nodes. lizardfs is running on these two nodes and mounted as datastores under /var/lib/one (.ssh files are physically on each node, that’s why i chose not to mount as /var/lib/one.

New vm inst, live vm migration etc all look fine so far, but i’m stuck in host high availability.

I’ve enabled it in oned.conf as per documented:
HOST_HOOK = [
name = “error”,
on = “ERROR”,
command = “ft/host_error.rb”,
arguments = “$ID -m -p 5”,
remote = “no” ]

Then i shutdown a host which runs a VM. Host went to ERROR state as expected. But the VM in it stays in the same host in UNKNOWN state. It is not migrated to active host.

Here’re the logs which may be related:
Wed Jun 8 18:12:14 2016 [Z0][MKP][D]: Monitoring marketplace OpenNebula Public (0)
Wed Jun 8 18:12:14 2016 [Z0][InM][D]: Monitoring datastore nfs_images (102)
Wed Jun 8 18:12:14 2016 [Z0][InM][D]: Monitoring datastore nfs_system (103)
Wed Jun 8 18:12:15 2016 [Z0][ImM][D]: Datastore nfs_images (102) successfully monitored.
Wed Jun 8 18:12:15 2016 [Z0][VMM][D]: VM 22 successfully monitored: DISK_SIZE=[ID=0,SIZE=1417] SNAPSHOT_SIZE=[ID=0,DISK_ID=0,SIZE=1417] DISK_SIZE=[ID=1,SIZE=395] DISK_SIZE=[ID=2,SIZE=1]
Wed Jun 8 18:12:15 2016 [Z0][ImM][D]: Datastore nfs_system (103) successfully monitored.
Wed Jun 8 18:12:15 2016 [Z0][MKP][D]: Marketplace OpenNebula Public (0) successfully monitored.
Wed Jun 8 18:12:16 2016 [Z0][InM][I]: Command execution fail: 'if [ -x “/var/tmp/one/im/run_probes” ]; then /var/tmp/one/im/run_probes kvm /var/lib/one//datastores 4124 20 4 comp2; else exit 42; fi’
Wed Jun 8 18:12:16 2016 [Z0][InM][I]: ssh: connect to host comp2 port 22: No route to host
Wed Jun 8 18:12:16 2016 [Z0][InM][I]: ExitCode: 255
Wed Jun 8 18:12:19 2016 [Z0][InM][I]: Command execution fail: 'if [ -x “/var/tmp/one/im/run_probes” ]; then /var/tmp/one/im/run_probes kvm /var/lib/one//datastores 4124 20 4 comp2; else exit 42; fi’
Wed Jun 8 18:12:19 2016 [Z0][InM][I]: ssh: connect to host comp2 port 22: No route to host
Wed Jun 8 18:12:19 2016 [Z0][InM][I]: ExitCode: 255
Wed Jun 8 18:12:19 2016 [Z0][ONE][E]: Error monitoring Host comp2 (4): -
Wed Jun 8 18:12:26 2016 [Z0][ReM][D]: Req:7376 UID:0 VirtualMachinePoolInfo invoked , -2, -1, -1, -1
Wed Jun 8 18:12:26 2016 [Z0][ReM][D]: Req:7376 UID:0 VirtualMachinePoolInfo result SUCCESS, "<VM_POOL>22<…"
Wed Jun 8 18:12:26 2016 [Z0][ReM][D]: Req:7872 UID:0 VirtualMachinePoolInfo invoked , -2, -1, -1, -1
Wed Jun 8 18:12:26 2016 [Z0][ReM][D]: Req:7872 UID:0 VirtualMachinePoolInfo result SUCCESS, "<VM_POOL>22<…"
Wed Jun 8 18:12:29 2016 [Z0][InM][D]: Host comp1 (3) successfully monitored.
Wed Jun 8 18:12:29 2016 [Z0][VMM][D]: VM 22 successfully monitored: STATE=a CPU=0.0 MEMORY=2097152

Any ideas?

Thanks,
Orhan

Hi

It seems that the hook is not triggered? Could you check if the VMs in
unknown step have the RESCHED flag (onevm show -x). Are there any messages
about the hook execution in oned.log?

Hi Ruben,

Thanks for the prompt reply. Yes it is “RESCHED : No” as you guessed. So how can i change this for a VM? there is no edit option for this property in vm details in sunstone.

And how can i define it globally (or maybe in template but couldnt find in template options?)

thanks,
Orhan

onevm resched <VM_ID>

But this should be triggered by the HA hook. You can do it manually for
now, as we check if there is any problem with the hook.

Cheers

BTW, Note that you have -p 5 . This means that by default the host needs to
be 5 minutes in error state before taking any action. Could you check this?

I had waited almost half an hour.
Anyway tried it with 1, but no change…
Only thing I’ve seen in the logs indicating an error is:
Thu Jun 9 11:13:32 2016 [Z0][InM][I]: Command execution fail: 'if [ -x “/var/tmp/one/im/run_probes” ]; then /var/tmp/one/im/run_probes kvm /var/lib/one//datastores 4124 20 4 comp2; else
I’ve checked the script in comp1 (active node) : /var/tmp/one/im/run_probes". ın description it says:
#Arguments: hypervisor(0) ds_location(1) collectd_port(2) host_id(3) hostname(4)
in the error line the command was tried to run as:
/var/tmp/one/im/run_probes kvm /var/lib/one//datastores 4124 20 4 comp2
obviously hypervisor(0)=kvm
ds_location(1)=/var/lib/one//datastores
collectd_port(2)=4124 (?)
host_id(3)=4 (checked, that’s correct)
hostname(4)=comp2

But where the “20” in the command comes from? i’m not a developer but seems to me that argument number doesn’t match…

Pls advise…

If the host is down it is normal the error you are seeing in the logs.
After 5 retries, it should put the VM in resched… I’ll try to reproduce
the behavior of your installation

Hi,

OpenNebula is repeatedly checking the host is it back and during this time the host is in state 5 (MONITORING_ERROR)

Can you try to edit /var/lib/one/remotes/hooks/ft/host_error.rb and replace:

# If the host came back, exit! avoid duplicated VMs
exit 0 if host.state != 3

to

# If the host came back, exit! avoid duplicated VMs
exit 0 if host.state != 3 and host.state != 5

Here is how it is solved in our ft hook script:

Kind Regards,
Anton Todorov

I’ve restarted fe and nodes, it seems to be working now. I dont know why it didn’t before (service was restarted)

Anyway now I’m experiencing another problem. Both hosts are up and running but the rescheduled (with hook) VM is not able to be migrated anymore. Neither normal nor live. I’ve tried it in both states; running, poweroff but no luck. There’s just one error in the log: VirtualMachineMigrate result FAILURE [VirtualMachineMigrate] Migrate action is not available for state RUNNING … (or POWER OFF)

Then I deleted the VM and created a new one with the same persistent OS image. migrating is now working with this new VM.

Could you try to reprpduce the behaviour…

Thanks,

Hi,

I’ve managed to reproduce the “Migrate action is not available for state RUNNING” on v5.0.0:

  1. run VMs on host
  2. power off the host
  3. wait the host_error hook to reschedule the VMs
  4. try to migrate(ot migrate-live) a VM - the same

Same result: “Migrate action is not available for state RUNNING”.

Kind Regards,
Anton Todorov

Hi,

After applying the UNKNOWN patch proposed by Anton I cannot reproduce this, I got the VMs in unknown and can migrate after that.

Cheers