Host Failure hooks for High Availability

bpsRobert · July 20, 2015, 3:43pm

The Host failure hooks in oned.conf seem to only support a “delete/re-create” of VMs on a failed hypervisor. Is this intentional? We had a hypervisor fall over, and the HA hooks kicked in, deploying the VMs on alternate hypervisors, BUT those VMs were now reverted back to their original deployment stage, with no data on the disks from previous runtimes of the VMs.

Are there other hooks to use, which will just boot the VMs on an alternate hypervisor during host failure? We’re using shared storage for all hypervisors. I need an HA solution that does not delete the VM, so that data can be recovered.

Thanks,

-Robert

cmartin · July 21, 2015, 10:05am

Hi,

What version are you using? The fault tolerance hook included in oned.conf has a -m option to do exactly that: migrate VMs to a new host when a host reaches the failure state. This migration keeps the current disks if the storage is shared.

#*******************************************************************************
# Fault Tolerance Hooks
#*******************************************************************************
# This hook is used to perform recovery actions when a host fails.
# Script to implement host failure tolerance
#   It can be set to
#           -m migrate VMs to another host. Only for images in shared storage
#           -r recreate VMs running in the host. State will be lost.
#           -d delete VMs running in the host
#   Additional flags
#           -f force resubmission of suspended VMs
#           -p <n> avoid resubmission if host comes
#                  back after n monitoring cycles
#*******************************************************************************

HOST_HOOK = [
    name      = "error",
    on        = "ERROR",
    command   = "ft/host_error.rb",
    arguments = "$ID -m -p 5",
    remote    = "no" ]
#-------------------------------------------------------------------------------

surya · February 16, 2017, 1:05pm

What are the steps to achieve Host failure HA in Opennebula 5.2. I have tried to shutdown host to Error state. But its not working. VM is still in unkown state. VM is not migrating into the second host in the cluster. How to achieve this?

feldsam · February 17, 2017, 5:03pm

And don’t forget to add proper fencing to that host_error hook!!!

GabrielDias · February 17, 2017, 6:30pm

Hi,

Try to use key -u

HOST_HOOK = [
    NAME      = "error",
    ON        = "ERROR",
    COMMAND   = "ft/host_error.rb",
    ARGUMENTS = "$ID -m -p 1 -u",
    REMOTE    = "no" ]

ranjith · February 18, 2017, 11:24am

Hi,

I tried all the step showed in the above video but its not working.
These are the steps i have followed to achieve Host Failover HA:

Uncommented the Hooks configuration in oned.conf file
FrontEnd sunstone service restarted
Host1 with vm1
Host2 with vm2,vm3,vm4
I have manually connected to Host1 using ssh and typed the reboot command
Host1 changed to ERROR state
vm1 changed to UNKNOWN state
But vm1 is not migrating into the second available host i.e Host2

This is the problem i have. How to solve this issue? Where to look for Hook log files? I have checked the oned.log file but there is no line in the file mentioned hooks is triggered or launched. HOOKs is not working properly.

Please reply fast

Thank
ranjith

GabrielDias · February 18, 2017, 12:25pm

Did you have no-password authorization between hosts for oneadmin user?
What log says?

atodorov_storpool · February 18, 2017, 8:58pm

Hi ranjith,

When changing /etc/one/oned.conf you must restart the oned daemon with systemctl restart opennebula(or the equivalent for your OS). The sunstone is nothing more that a fancy GUI which communicates with oned via RPC almost in the same way as the shell tools are doing.

Then you should become familiar with the arguments that you have passed to the HOST_ERROR hook:

IF you are confident that the host will not come back before the VM migration is complete, you can add ‘-u’ argument to disable the call to the fencing script.

You must take in account the fault tolerance - the ‘-p’ argument in the HOST_HOOK section. If the argument is not 0 it will wait the argument times MONITORING_INTERVAL before starting VM migration. The default if not set is claimed to be 2, but in the example it is 5. Most probably you do not tweak the MONITORING_INTERVAL in /etc/one/oned.conf (default is 60 seconds). So if you are using the example HOST_HOOK without changes you have fault tolerance window of 5 * 60 = 300 seconds (5 minuties!!!). If your host reboots faster - you must halt instead of rebooting or change the configuration depending on your needs.

Hope this helps.

Kind Regards,
Anton Todorov

ranjith · February 20, 2017, 3:02am

yes i have set no-password authorization between hosts for oneadmin user. Nothing in the logs. No words matching hooks??

ranjith · February 20, 2017, 4:23am

Which cluster opennebula uses for Host failure HA? How can i select the cluster to do the HA? Is it necessary that Image and System datastore to be in share mode storage backend for Host failure HA? If that how can i make the default cluster to use that share mode datastore to achieve Host failure HA?

ranjith · February 20, 2017, 4:46am

Have you used any distributed FS to achieve this host failover HA? If yes, which one you used. If not I have followed all your steps from your video. It’s not working. Which log i can look for Hook launched and Hook Finished message? I saw oned.log file. Nothing in the file named Hook? Is it necessary that Image and System datastore in the default cluster to be shared mode storage backend to achieve Host Failure HA? If yes how to configure that?

ranjith · February 20, 2017, 5:38am

This is the error i got in oned.log file:
Mon Feb 20 11:05:36 2017 [Z0][HKM][D]: Message received: LOG I 3 Command execution fail: /var/lib/one/remotes//hooks/ft/host_error.rb 3 -m -p 1

Mon Feb 20 11:07:17 2017 [Z0][InM][I]: Command execution fail: 'if [ -x “/var/tmp/one/im/run_probes” ]; then /var/tmp/one/im/run_probes kvm /var/lib/one//datastores 4124 20 3 OpennebulaND; else exit 42; fi’
Mon Feb 20 11:07:17 2017 [Z0][InM][I]: ssh: connect to host opennebuland port 22: No route to host
Mon Feb 20 11:07:17 2017 [Z0][InM][I]: ExitCode: 255
Mon Feb 20 11:07:20 2017 [Z0][InM][I]: Command execution fail: 'if [ -x “/var/tmp/one/im/run_probes” ]; then /var/tmp/one/im/run_probes kvm /var/lib/one//datastores 4124 20 3 OpennebulaND; else exit 42; fi’
Mon Feb 20 11:07:20 2017 [Z0][InM][I]: ssh: connect to host opennebuland port 22: No route to host
Mon Feb 20 11:07:20 2017 [Z0][InM][I]: ExitCode: 255

Attributes error:

Mon Feb 20 09:50:29 2017 : Error deploying virtual machine: Error creating directory /var/lib/one/datastores/0/61 at OpennebulaND: ssh: connect to host opennebuland port 22: No route to host

GabrielDias · February 20, 2017, 7:29am

Yes, I use CephRBD for ImagesDS & SystemDS.
You must see host_error.log, when hook working.
So, you can use shared mode.
http://docs.opennebula.org/5.2/deployment/open_cloud_storage_setup/fs_ds.html#shared-qcow2-transfer-modes

Check network settings, if you change SSH port, edit it in ~/.ssh/config

Host example.com
Port 1234

surya · February 20, 2017, 7:48am

host_error.log:

[2017-02-20 12:05:06 +0530][HOST 3][I] Hook launched
[2017-02-20 12:05:06 +0530][HOST 3][I] hostname: OpennebulaND
[2017-02-20 12:05:06 +0530][HOST 3][I] Wait 1 cycles.
[2017-02-20 12:05:06 +0530][HOST 3][I] Sleeping 60 seconds.
[2017-02-20 12:06:06 +0530][HOST 3][I] Fencing enabled
[2017-02-20 12:06:06 +0530][HOST 3][E]
[2017-02-20 12:06:06 +0530][HOST 3][E] Fencing error
[2017-02-20 12:06:06 +0530][HOST 3][E] Exiting due to previous error.

There is no file in this location like ~/.ssh/config?

GabrielDias · February 20, 2017, 7:51am

You must to create it.

Host hook worked. Try to use -u argument.
http://docs.opennebula.org/5.2/advanced_components/ha/ftguide.html#host-failures

surya · February 20, 2017, 8:07am

Even after changing Host Hook arguments to -u same fencing error occurs and VM goes to UNKNOWN state, not migrating. I didn’t use any distributed FS like Ceph RBD. I haven’t changed SSH port. It is default by installation.

GabrielDias · February 20, 2017, 8:15am

You must to have Shared storage (Ceph, NFS & etc.) between your hosts.
Unknown state, its right, because HOST in ERROR state, when hook execute VM changes state when will`be reshedule to another host.

surya · February 20, 2017, 8:16am

Ok thanks. I will try after installing Shared storage. Which one is easy to use or which one is best? Where is fence_host.sh file located?

amindomao · December 23, 2017, 7:54am

Hello guys!
Let’s assume the situation.
node2 running the VM is just “gone” because of network problems. Fencing mechanism is disabled. The VM has been successfully migrated to node1. And suddenly node2 appeared again running the VM. In my configuration with CEPH storage this situation has led to file system corruption in the migrated VM running by node1.
What options do I have to deal with this situation?

heathen · December 26, 2017, 7:04am

Enable fencing, probably.

Topic		Replies	Views
Migrate VM on host crash Product Support	5	2341	December 17, 2015
Opennebula 5.0 Beta - Host Hook Product Support	10	1529	June 23, 2016
Unable to set up HA for VM - ONE 6.0 Operations	17	4660	January 29, 2022
VM HA not working after update to 5.2 Product Support	11	1627	February 20, 2017
Host Architecture for HA Product Support	3	1057	June 9, 2016

Host Failure hooks for High Availability

Related topics