my setup includes three physical servers. one FE + two nodes. lizardfs is running on these two nodes and mounted as datastores under /var/lib/one (.ssh files are physically on each node, that’s why i chose not to mount as /var/lib/one.
New vm inst, live vm migration etc all look fine so far, but i’m stuck in host high availability.
I’ve enabled it in oned.conf as per documented:
HOST_HOOK = [
name = “error”,
on = “ERROR”,
command = “ft/host_error.rb”,
arguments = “$ID -m -p 5”,
remote = “no” ]
Then i shutdown a host which runs a VM. Host went to ERROR state as expected. But the VM in it stays in the same host in UNKNOWN state. It is not migrated to active host.
Here’re the logs which may be related:
Wed Jun 8 18:12:14 2016 [Z0][MKP][D]: Monitoring marketplace OpenNebula Public (0)
Wed Jun 8 18:12:14 2016 [Z0][InM][D]: Monitoring datastore nfs_images (102)
Wed Jun 8 18:12:14 2016 [Z0][InM][D]: Monitoring datastore nfs_system (103)
Wed Jun 8 18:12:15 2016 [Z0][ImM][D]: Datastore nfs_images (102) successfully monitored.
Wed Jun 8 18:12:15 2016 [Z0][VMM][D]: VM 22 successfully monitored: DISK_SIZE=[ID=0,SIZE=1417] SNAPSHOT_SIZE=[ID=0,DISK_ID=0,SIZE=1417] DISK_SIZE=[ID=1,SIZE=395] DISK_SIZE=[ID=2,SIZE=1]
Wed Jun 8 18:12:15 2016 [Z0][ImM][D]: Datastore nfs_system (103) successfully monitored.
Wed Jun 8 18:12:15 2016 [Z0][MKP][D]: Marketplace OpenNebula Public (0) successfully monitored.
Wed Jun 8 18:12:16 2016 [Z0][InM][I]: Command execution fail: 'if [ -x “/var/tmp/one/im/run_probes” ]; then /var/tmp/one/im/run_probes kvm /var/lib/one//datastores 4124 20 4 comp2; else exit 42; fi’
Wed Jun 8 18:12:16 2016 [Z0][InM][I]: ssh: connect to host comp2 port 22: No route to host
Wed Jun 8 18:12:16 2016 [Z0][InM][I]: ExitCode: 255
Wed Jun 8 18:12:19 2016 [Z0][InM][I]: Command execution fail: 'if [ -x “/var/tmp/one/im/run_probes” ]; then /var/tmp/one/im/run_probes kvm /var/lib/one//datastores 4124 20 4 comp2; else exit 42; fi’
Wed Jun 8 18:12:19 2016 [Z0][InM][I]: ssh: connect to host comp2 port 22: No route to host
Wed Jun 8 18:12:19 2016 [Z0][InM][I]: ExitCode: 255
Wed Jun 8 18:12:19 2016 [Z0][ONE][E]: Error monitoring Host comp2 (4): -
Wed Jun 8 18:12:26 2016 [Z0][ReM][D]: Req:7376 UID:0 VirtualMachinePoolInfo invoked , -2, -1, -1, -1
Wed Jun 8 18:12:26 2016 [Z0][ReM][D]: Req:7376 UID:0 VirtualMachinePoolInfo result SUCCESS, "<VM_POOL>22<…"
Wed Jun 8 18:12:26 2016 [Z0][ReM][D]: Req:7872 UID:0 VirtualMachinePoolInfo invoked , -2, -1, -1, -1
Wed Jun 8 18:12:26 2016 [Z0][ReM][D]: Req:7872 UID:0 VirtualMachinePoolInfo result SUCCESS, "<VM_POOL>22<…"
Wed Jun 8 18:12:29 2016 [Z0][InM][D]: Host comp1 (3) successfully monitored.
Wed Jun 8 18:12:29 2016 [Z0][VMM][D]: VM 22 successfully monitored: STATE=a CPU=0.0 MEMORY=2097152
It seems that the hook is not triggered? Could you check if the VMs in
unknown step have the RESCHED flag (onevm show -x). Are there any messages
about the hook execution in oned.log?
Thanks for the prompt reply. Yes it is “RESCHED : No” as you guessed. So how can i change this for a VM? there is no edit option for this property in vm details in sunstone.
And how can i define it globally (or maybe in template but couldnt find in template options?)
BTW, Note that you have -p 5 . This means that by default the host needs to
be 5 minutes in error state before taking any action. Could you check this?
I had waited almost half an hour.
Anyway tried it with 1, but no change…
Only thing I’ve seen in the logs indicating an error is:
Thu Jun 9 11:13:32 2016 [Z0][InM][I]: Command execution fail: 'if [ -x “/var/tmp/one/im/run_probes” ]; then /var/tmp/one/im/run_probes kvm /var/lib/one//datastores 4124 20 4 comp2; else
I’ve checked the script in comp1 (active node) : /var/tmp/one/im/run_probes". ın description it says: #Arguments: hypervisor(0) ds_location(1) collectd_port(2) host_id(3) hostname(4)
in the error line the command was tried to run as:
/var/tmp/one/im/run_probes kvm /var/lib/one//datastores 4124 20 4 comp2
obviously hypervisor(0)=kvm
ds_location(1)=/var/lib/one//datastores
collectd_port(2)=4124 (?)
host_id(3)=4 (checked, that’s correct)
hostname(4)=comp2
But where the “20” in the command comes from? i’m not a developer but seems to me that argument number doesn’t match…
If the host is down it is normal the error you are seeing in the logs.
After 5 retries, it should put the VM in resched… I’ll try to reproduce
the behavior of your installation
I’ve restarted fe and nodes, it seems to be working now. I dont know why it didn’t before (service was restarted)
Anyway now I’m experiencing another problem. Both hosts are up and running but the rescheduled (with hook) VM is not able to be migrated anymore. Neither normal nor live. I’ve tried it in both states; running, poweroff but no luck. There’s just one error in the log: VirtualMachineMigrate result FAILURE [VirtualMachineMigrate] Migrate action is not available for state RUNNING … (or POWER OFF)
Then I deleted the VM and created a new one with the same persistent OS image. migrating is now working with this new VM.