Question regarding fault tolerance on VMs

Hi all,

I’ve set up an OpenNebula 5.10.1 environment consisting of one controller and two satellites. The datastore is a linstor/drbd9. Live migration works fine, I’m quite happy with it.

To reach the next level I’d like to setup a fault tolerance which migrates machines from a broken satellite to a running one.
As it’s only a test environment, I don’t use fencing at all.
So that’s my hook config that I’ve added to OpenNebula using ‘onehook create’:

ARGUMENTS = "$TEMPLATE -m -p 2 -u"
COMMAND   = "ft/host_error.rb"
NAME      = "host_error"
REMOTE    = "no"
TYPE      = state

When shutting down one of the satellite nodes the test-vm gets migrated to a running node after a while. Yay.

Still I’ve got three open questions regarding the fault-tolerance subject:

(1) Is there an easy/builtin way to tag a VM to enable fault tolerance? I want only specific VMs to respawn on the other host in case of an error.

(2) That’s kinda critical for me: How to handle a connection error between the controller and the satellites? In my (later) setup the controller node is not in the same datacenter as the satellites. In worst case it might happen that the controller loses its connection to both of the satellites simultanously.

That’s my test so far:

$ onehost list
  ID NAME                       CLUSTER    TVM      ALLOCATED_CPU      ALLOCATED_MEM STAT
   1 satelliteB   default      0      0 / 3200 (0%)   0K / 188.7G (0%) on  
   0 satelliteA   default      1    200 / 3200 (6%) 1.5G / 188.7G (0%) on  

$ onevm list
  34 oneadmin oneadmin testvm  runn  2.0  240.6M satelliteA  0d 03h43

$ ip route add blackhole $satelliteA_IP ; ip route add blackhole $satelliteB_IP

$ sleep 300

$ onehook log --hook-id 0
    0     1     02/13 13:55     0 SUCCESS
    0     2     02/13 13:57     0 SUCCESS

The logs show that OpenNebula tries to migrate the VM from satelliteA to satelliteB at 13:55. Don’t know why it returns a “SUCCESS” as no satellite can be reached and the VM is still running on satelliteA. Two minutes later it tries to migrate the VM back from satelliteB to satelliteA which, obviously, has no other effect other than a “SUCCESS” message. That’s kinda weird.

(3) How to prevent the fault tolerance if the satellite is (for whatever reason) unreachable but the VM that runs on that satellite is still seen as “running” by OpenNebula? As an example I could imagine the management interface of that satellite to be down but the VM uses another NIC that’s still up.

Thanks :slight_smile: