Host Failure hooks for High Availability

The Host failure hooks in oned.conf seem to only support a “delete/re-create” of VMs on a failed hypervisor. Is this intentional? We had a hypervisor fall over, and the HA hooks kicked in, deploying the VMs on alternate hypervisors, BUT those VMs were now reverted back to their original deployment stage, with no data on the disks from previous runtimes of the VMs.

Are there other hooks to use, which will just boot the VMs on an alternate hypervisor during host failure? We’re using shared storage for all hypervisors. I need an HA solution that does not delete the VM, so that data can be recovered.

Thanks,

-Robert

Hi,

What version are you using? The fault tolerance hook included in oned.conf has a -m option to do exactly that: migrate VMs to a new host when a host reaches the failure state. This migration keeps the current disks if the storage is shared.

#*******************************************************************************
# Fault Tolerance Hooks
#*******************************************************************************
# This hook is used to perform recovery actions when a host fails.
# Script to implement host failure tolerance
#   It can be set to
#           -m migrate VMs to another host. Only for images in shared storage
#           -r recreate VMs running in the host. State will be lost.
#           -d delete VMs running in the host
#   Additional flags
#           -f force resubmission of suspended VMs
#           -p <n> avoid resubmission if host comes
#                  back after n monitoring cycles
#*******************************************************************************

HOST_HOOK = [
    name      = "error",
    on        = "ERROR",
    command   = "ft/host_error.rb",
    arguments = "$ID -m -p 5",
    remote    = "no" ]
#-------------------------------------------------------------------------------
1 Like

What are the steps to achieve Host failure HA in Opennebula 5.2. I have tried to shutdown host to Error state. But its not working. VM is still in unkown state. VM is not migrating into the second host in the cluster. How to achieve this?

And don’t forget to add proper fencing to that host_error hook!!!

Hi,

Try to use key -u

HOST_HOOK = [
    NAME      = "error",
    ON        = "ERROR",
    COMMAND   = "ft/host_error.rb",
    ARGUMENTS = "$ID -m -p 1 -u",
    REMOTE    = "no" ]
3 Likes

Hi,

I tried all the step showed in the above video but its not working.
These are the steps i have followed to achieve Host Failover HA:

  1. Uncommented the Hooks configuration in oned.conf file
  2. FrontEnd sunstone service restarted
  3. Host1 with vm1
  4. Host2 with vm2,vm3,vm4
  5. I have manually connected to Host1 using ssh and typed the reboot command
  6. Host1 changed to ERROR state
  7. vm1 changed to UNKNOWN state
  8. But vm1 is not migrating into the second available host i.e Host2

This is the problem i have. How to solve this issue? Where to look for Hook log files? I have checked the oned.log file but there is no line in the file mentioned hooks is triggered or launched. HOOKs is not working properly.

Please reply fast

Thank
ranjith

Did you have no-password authorization between hosts for oneadmin user?
What log says?

Hi ranjith,

When changing /etc/one/oned.conf you must restart the oned daemon with systemctl restart opennebula(or the equivalent for your OS). The sunstone is nothing more that a fancy GUI which communicates with oned via RPC almost in the same way as the shell tools are doing.

Then you should become familiar with the arguments that you have passed to the HOST_ERROR hook:

  • IF you are confident that the host will not come back before the VM migration is complete, you can add ‘-u’ argument to disable the call to the fencing script.
  • You must take in account the fault tolerance - the ‘-p’ argument in the HOST_HOOK section. If the argument is not 0 it will wait the argument times MONITORING_INTERVAL before starting VM migration. The default if not set is claimed to be 2, but in the example it is 5. Most probably you do not tweak the MONITORING_INTERVAL in /etc/one/oned.conf (default is 60 seconds). So if you are using the example HOST_HOOK without changes you have fault tolerance window of 5 * 60 = 300 seconds (5 minuties!!!). If your host reboots faster - you must halt instead of rebooting or change the configuration depending on your needs.

Hope this helps.

Kind Regards,
Anton Todorov

1 Like

yes i have set no-password authorization between hosts for oneadmin user. Nothing in the logs. No words matching hooks??

Which cluster opennebula uses for Host failure HA? How can i select the cluster to do the HA? Is it necessary that Image and System datastore to be in share mode storage backend for Host failure HA? If that how can i make the default cluster to use that share mode datastore to achieve Host failure HA?

Have you used any distributed FS to achieve this host failover HA? If yes, which one you used. If not I have followed all your steps from your video. It’s not working. Which log i can look for Hook launched and Hook Finished message? I saw oned.log file. Nothing in the file named Hook? Is it necessary that Image and System datastore in the default cluster to be shared mode storage backend to achieve Host Failure HA? If yes how to configure that?

This is the error i got in oned.log file:
Mon Feb 20 11:05:36 2017 [Z0][HKM][D]: Message received: LOG I 3 Command execution fail: /var/lib/one/remotes//hooks/ft/host_error.rb 3 -m -p 1

Mon Feb 20 11:07:17 2017 [Z0][InM][I]: Command execution fail: 'if [ -x “/var/tmp/one/im/run_probes” ]; then /var/tmp/one/im/run_probes kvm /var/lib/one//datastores 4124 20 3 OpennebulaND; else exit 42; fi’
Mon Feb 20 11:07:17 2017 [Z0][InM][I]: ssh: connect to host opennebuland port 22: No route to host
Mon Feb 20 11:07:17 2017 [Z0][InM][I]: ExitCode: 255
Mon Feb 20 11:07:20 2017 [Z0][InM][I]: Command execution fail: 'if [ -x “/var/tmp/one/im/run_probes” ]; then /var/tmp/one/im/run_probes kvm /var/lib/one//datastores 4124 20 3 OpennebulaND; else exit 42; fi’
Mon Feb 20 11:07:20 2017 [Z0][InM][I]: ssh: connect to host opennebuland port 22: No route to host
Mon Feb 20 11:07:20 2017 [Z0][InM][I]: ExitCode: 255

Attributes error:

Mon Feb 20 09:50:29 2017 : Error deploying virtual machine: Error creating directory /var/lib/one/datastores/0/61 at OpennebulaND: ssh: connect to host opennebuland port 22: No route to host

Yes, I use CephRBD for ImagesDS & SystemDS.
You must see host_error.log, when hook working.
So, you can use shared mode.
http://docs.opennebula.org/5.2/deployment/open_cloud_storage_setup/fs_ds.html#shared-qcow2-transfer-modes

Check network settings, if you change SSH port, edit it in ~/.ssh/config

Host example.com
Port 1234
2 Likes

host_error.log:

[2017-02-20 12:05:06 +0530][HOST 3][I] Hook launched
[2017-02-20 12:05:06 +0530][HOST 3][I] hostname: OpennebulaND
[2017-02-20 12:05:06 +0530][HOST 3][I] Wait 1 cycles.
[2017-02-20 12:05:06 +0530][HOST 3][I] Sleeping 60 seconds.
[2017-02-20 12:06:06 +0530][HOST 3][I] Fencing enabled
[2017-02-20 12:06:06 +0530][HOST 3][E]
[2017-02-20 12:06:06 +0530][HOST 3][E] Fencing error
[2017-02-20 12:06:06 +0530][HOST 3][E] Exiting due to previous error.

There is no file in this location like ~/.ssh/config?

1 Like

You must to create it.

Host hook worked. Try to use -u argument.
http://docs.opennebula.org/5.2/advanced_components/ha/ftguide.html#host-failures

Even after changing Host Hook arguments to -u same fencing error occurs and VM goes to UNKNOWN state, not migrating. I didn’t use any distributed FS like Ceph RBD. I haven’t changed SSH port. It is default by installation.

You must to have Shared storage (Ceph, NFS & etc.) between your hosts.
Unknown state, its right, because HOST in ERROR state, when hook execute VM changes state when will`be reshedule to another host.

1 Like

Ok thanks. I will try after installing Shared storage. Which one is easy to use or which one is best? Where is fence_host.sh file located?

Hello guys!
Let’s assume the situation.
node2 running the VM is just “gone” because of network problems. Fencing mechanism is disabled. The VM has been successfully migrated to node1. And suddenly node2 appeared again running the VM. In my configuration with CEPH storage this situation has led to file system corruption in the migrated VM running by node1.
What options do I have to deal with this situation?

Enable fencing, probably.