Opennebula 5.0 Beta - Host Hook

silbogg · June 8, 2016, 3:13pm

Hi,

my setup includes three physical servers. one FE + two nodes. lizardfs is running on these two nodes and mounted as datastores under /var/lib/one (.ssh files are physically on each node, that’s why i chose not to mount as /var/lib/one.

New vm inst, live vm migration etc all look fine so far, but i’m stuck in host high availability.

I’ve enabled it in oned.conf as per documented:
HOST_HOOK = [
name = “error”,
on = “ERROR”,
command = “ft/host_error.rb”,
arguments = “$ID -m -p 5”,
remote = “no” ]

Then i shutdown a host which runs a VM. Host went to ERROR state as expected. But the VM in it stays in the same host in UNKNOWN state. It is not migrated to active host.

Here’re the logs which may be related:
Wed Jun 8 18:12:14 2016 [Z0][MKP][D]: Monitoring marketplace OpenNebula Public (0)
Wed Jun 8 18:12:14 2016 [Z0][InM][D]: Monitoring datastore nfs_images (102)
Wed Jun 8 18:12:14 2016 [Z0][InM][D]: Monitoring datastore nfs_system (103)
Wed Jun 8 18:12:15 2016 [Z0][ImM][D]: Datastore nfs_images (102) successfully monitored.
Wed Jun 8 18:12:15 2016 [Z0][VMM][D]: VM 22 successfully monitored: DISK_SIZE=[ID=0,SIZE=1417] SNAPSHOT_SIZE=[ID=0,DISK_ID=0,SIZE=1417] DISK_SIZE=[ID=1,SIZE=395] DISK_SIZE=[ID=2,SIZE=1]
Wed Jun 8 18:12:15 2016 [Z0][ImM][D]: Datastore nfs_system (103) successfully monitored.
Wed Jun 8 18:12:15 2016 [Z0][MKP][D]: Marketplace OpenNebula Public (0) successfully monitored.
Wed Jun 8 18:12:16 2016 [Z0][InM][I]: Command execution fail: 'if [ -x “/var/tmp/one/im/run_probes” ]; then /var/tmp/one/im/run_probes kvm /var/lib/one//datastores 4124 20 4 comp2; else exit 42; fi’
Wed Jun 8 18:12:16 2016 [Z0][InM][I]: ssh: connect to host comp2 port 22: No route to host
Wed Jun 8 18:12:16 2016 [Z0][InM][I]: ExitCode: 255
Wed Jun 8 18:12:19 2016 [Z0][InM][I]: Command execution fail: 'if [ -x “/var/tmp/one/im/run_probes” ]; then /var/tmp/one/im/run_probes kvm /var/lib/one//datastores 4124 20 4 comp2; else exit 42; fi’
Wed Jun 8 18:12:19 2016 [Z0][InM][I]: ssh: connect to host comp2 port 22: No route to host
Wed Jun 8 18:12:19 2016 [Z0][InM][I]: ExitCode: 255
Wed Jun 8 18:12:19 2016 [Z0][ONE][E]: Error monitoring Host comp2 (4): -
Wed Jun 8 18:12:26 2016 [Z0][ReM][D]: Req:7376 UID:0 VirtualMachinePoolInfo invoked , -2, -1, -1, -1
Wed Jun 8 18:12:26 2016 [Z0][ReM][D]: Req:7376 UID:0 VirtualMachinePoolInfo result SUCCESS, "<VM_POOL>22<…"
Wed Jun 8 18:12:26 2016 [Z0][ReM][D]: Req:7872 UID:0 VirtualMachinePoolInfo invoked , -2, -1, -1, -1
Wed Jun 8 18:12:26 2016 [Z0][ReM][D]: Req:7872 UID:0 VirtualMachinePoolInfo result SUCCESS, "<VM_POOL>22<…"
Wed Jun 8 18:12:29 2016 [Z0][InM][D]: Host comp1 (3) successfully monitored.
Wed Jun 8 18:12:29 2016 [Z0][VMM][D]: VM 22 successfully monitored: STATE=a CPU=0.0 MEMORY=2097152

Any ideas?

Thanks,
Orhan

ruben · June 8, 2016, 3:18pm

Hi

It seems that the hook is not triggered? Could you check if the VMs in
unknown step have the RESCHED flag (onevm show -x). Are there any messages
about the hook execution in oned.log?

silbogg · June 8, 2016, 3:32pm

Hi Ruben,

Thanks for the prompt reply. Yes it is “RESCHED : No” as you guessed. So how can i change this for a VM? there is no edit option for this property in vm details in sunstone.

And how can i define it globally (or maybe in template but couldnt find in template options?)

thanks,
Orhan

ruben · June 8, 2016, 3:37pm

onevm resched <VM_ID>

But this should be triggered by the HA hook. You can do it manually for
now, as we check if there is any problem with the hook.

Cheers

ruben · June 8, 2016, 3:39pm

BTW, Note that you have -p 5 . This means that by default the host needs to
be 5 minutes in error state before taking any action. Could you check this?

silbogg · June 9, 2016, 8:16am

I had waited almost half an hour.
Anyway tried it with 1, but no change…
Only thing I’ve seen in the logs indicating an error is:
Thu Jun 9 11:13:32 2016 [Z0][InM][I]: Command execution fail: 'if [ -x “/var/tmp/one/im/run_probes” ]; then /var/tmp/one/im/run_probes kvm /var/lib/one//datastores 4124 20 4 comp2; else
I’ve checked the script in comp1 (active node) : /var/tmp/one/im/run_probes". ın description it says:
#Arguments: hypervisor(0) ds_location(1) collectd_port(2) host_id(3) hostname(4)
in the error line the command was tried to run as:
/var/tmp/one/im/run_probes kvm /var/lib/one//datastores 4124 20 4 comp2
obviously hypervisor(0)=kvm
ds_location(1)=/var/lib/one//datastores
collectd_port(2)=4124 (?)
host_id(3)=4 (checked, that’s correct)
hostname(4)=comp2

But where the “20” in the command comes from? i’m not a developer but seems to me that argument number doesn’t match…

Pls advise…

ruben · June 9, 2016, 10:05am

If the host is down it is normal the error you are seeing in the logs.
After 5 retries, it should put the VM in resched… I’ll try to reproduce
the behavior of your installation

atodorov_storpool · June 9, 2016, 11:12am

Hi,

OpenNebula is repeatedly checking the host is it back and during this time the host is in state 5 (MONITORING_ERROR)

Can you try to edit /var/lib/one/remotes/hooks/ft/host_error.rb and replace:

# If the host came back, exit! avoid duplicated VMs
exit 0 if host.state != 3

to

# If the host came back, exit! avoid duplicated VMs
exit 0 if host.state != 3 and host.state != 5

Here is how it is solved in our ft hook script:

github.com

OpenNebula/addon-storpool/blob/master/hooks/ft/sp_host_error.rb#L130


      
          host_name = host.name
          
          hstate = host.state
          
          if hstate != 3 and hstate != 5
              splog("#{host_name}(#{host_id}) END: Host is back (host.state:#{hstate})")
              exit 0
          else
              splog("#{host_name}(#{host_id}) BEGIN: (host.state:#{hstate})")
          end
          
          if repeat
              # Retrieve host monitor interval
              monitor_interval = nil
              File.readlines(CONFIG_FILE).each{|line|
                   monitor_interval = line.split("=").last.to_i if /MONITORING_INTERVAL/=~line
              }
              if monitor_interval
                  1.upto(repeat) do |i|
                      host.info
                      hstate = host.state

Kind Regards,
Anton Todorov

silbogg · June 9, 2016, 11:17am

I’ve restarted fe and nodes, it seems to be working now. I dont know why it didn’t before (service was restarted)

Anyway now I’m experiencing another problem. Both hosts are up and running but the rescheduled (with hook) VM is not able to be migrated anymore. Neither normal nor live. I’ve tried it in both states; running, poweroff but no luck. There’s just one error in the log: VirtualMachineMigrate result FAILURE [VirtualMachineMigrate] Migrate action is not available for state RUNNING … (or POWER OFF)

Then I deleted the VM and created a new one with the same persistent OS image. migrating is now working with this new VM.

Could you try to reprpduce the behaviour…

Thanks,

atodorov_storpool · June 20, 2016, 12:59pm

Hi,

I’ve managed to reproduce the “Migrate action is not available for state RUNNING” on v5.0.0:

run VMs on host
power off the host
wait the host_error hook to reschedule the VMs
try to migrate(ot migrate-live) a VM - the same

Same result: “Migrate action is not available for state RUNNING”.

Kind Regards,
Anton Todorov

ruben · June 23, 2016, 1:26pm

Hi,

After applying the UNKNOWN patch proposed by Anton I cannot reproduce this, I got the VMs in unknown and can migrate after that.

Cheers

Topic		Replies	Views
Problem with migrating vms after host error hook launched Community Support	2	2233	September 12, 2017
VMs fail to start after hypervisor failure General	2	310	February 15, 2021
Question regarding fault tolerance on VMs Community Support	0	585	February 13, 2020
Host Architecture for HA Community Support	3	1053	June 9, 2016
Running into errors, when a Hook is executed HA / Federation	10	878	March 5, 2025

Opennebula 5.0 Beta - Host Hook

Related topics