Host Failure hooks for High Availability

Thanks for your reply )

I’ll enable fencing. I’ve just enabled it. I have Dell servers with iDRAC.
But what if I have node2 in different DC for example and connection between node2 and oned will be lost including fencing in separate vlan?
It is “not good” “architecture” or there could be some other “magic” to deal with duplicate VM?
Sorry for newbies questions.

1 Like

I can’t imagine any magic here in case of ordinary software. :slight_smile:

Ordinary OSes with ordinary filesystems are built to work with dedicated sets of data (dedicated drives). What would happened if you try to write to the same HDD from two computers? You data will be corrupted. If you recall the CAP theorem, ordinary software requires C - consistency. If you use special software built with such situation in mind than its a different story. But I assume you are talking about usual OSes.

So HA setup with uneven number of ONE controllers and fencing is the only option to hope (not assured) to prevent split-brain data corruption.

1 Like

I don’t take in account unordinary software ), just centos, ceph and one. I thought, maybe there could be mechanism to fence zombie VM when it will appear. :question:

Good to know by the way. :handshake:

The problem is that if a zombie VM appeared - it’s probably too late to do anything with it, cause it will connect ceph backend much before you find it.

AFAIK ONE doesn’t have any mechanism for termination unwilling zombie VMs. Of course, you can create a script to monitor connection to the ONE front-end and kill all VMs (and\or shutdown a host) and put it on each host, for example. Or there can possibly be other options as well. But every solution depends on the particular task and particular hardware\software architecture. :slight_smile:

P.S. In my previous post I’ve meant uneven number of hosts, of course!

Ok, understood. I’m thinking almost the same in general.
I’ve read about ceph exclusive-lock feature and thought it could help me to pretect an image with ONE host hooks.

ONE uses it, but it unlocks images on VM start.

Imagine your host failed and you use hooks to restart VMs automatically on another host. If ONE will not be able to unlock VM’s image it will not be able to start VM again.

I assume following scenario: host failed, ONE uses host hook and unlocks the image, VM migrates to another host. ONE locks the image to protect it from being corrupted by revived zombie )

I thing this is because host/node is down/rebooted. So host/node is not reachable for ssh connection.

hi,

do i have to setup corosync and pacemaker to make HA on host failures ?

Fyi at this time i only setup opennebula with 5 nodes and shared storage glusterfs.