VM's HA working randomly

Hi guys, I’m trying to get the VM’s HA working correctly, but I can’t get it working correctly.

I have the next hyperconvergent setup:
Node1: Baremetal Ubuntu 20.04 (HP Proliant DL350 g10, 1 micro and 16GB RAM), wich runs the nexts services:

  • Shared datastorage with the other nodes (a distributed and replicated volumen).
  • KVM host (installed via opennebula-node package).
  • LXC host (the package wich comes with Ubuntu). Using an Ubuntu 20.04 container as the orchestrator1, this container runs the OpenNebula FrontEnd (Sunstone) and also act as the orchestrator for the hypervisor runing in the baremetal server.
  • All OpenNebula 6.2 packages.
  • IP: 192.168.10.101
  • Hostname: ON-N1-C1

Node2: Baremetal Ubuntu 20.04 (HP Proliant DL350 g10, 1 micro and 16GB RAM), wich runs the nexts services:

  • Shared datastorage (a distributed and replicated volumen).
  • KVM host (installed via opennebula-node package).
  • LXC host (the package wich comes with Ubuntu). Using an Ubuntu 20.04 container as the orchestrator2, this container runs the OpenNebula FrontEnd (Sunstone) and also act as the orchestrator for the hypervisor runing in the baremetal server.
  • All OpenNebula 6.2 packages.
  • IP: 192.168.10.102
  • Hostname: ON-N1-C2

Node3: Baremetal Ubuntu 20.04 (HP Proliant DL350 g10, 1 micro and 16GB RAM), wich runs the nexts services:

  • Shared datastorage (a distributed and replicated volumen).
  • KVM host (installed via opennebula-node package)
  • LXC host (the package wich comes with Ubuntu). Using an Ubuntu 20.04 container as the orchestrator3, this container runs the OpenNebula FrontEnd (Sunstone) and also act as the orchestrator for the hypervisor runing in the baremetal server.
  • All OpenNebula 6.2 packages.
  • IP: 192.168.10.101
  • Hostname: ON-N1-C1.

I have configured the FrontEnd HA and it works flawlessly. I’m able to create VMs and run/configure services on them.


So I recently read about the VM’s HA and I give it a try. I followed the guides, configure the HA and the hooks (with the ARGUMENTS = "$TEMPLATE -m -p 1 -u" options) and then run some tests to verified if the VM’s HA is working correctly. I run the nexts tests (manual power off means I turn it off and wait more than 10 minutes):

Test #1
Leader: orchestrator2

  • VM is in N3
  • Manual power off over N3
  • Expecting migration to N1 or N2
  • VM entered in UNKNOWN status and never get migrated, even when the hook was “successfully” executed

Test #2
Leader: orchestrator2

  • VM is in N1
  • Manual power off over N1
  • Expecting migration to N2 or N3
  • VM gets migrated to N3

Test #3
Leader: orchestrator2

  • VM is in N2
  • Manual power off over N2
  • New leader should be decided between orchestrator1 or orchestrator3
  • New leader is orchestrator3
  • Expecting migration to N1 or N3
  • VM gets migrated to N3

Test #4
Leader: orchestrator3

  • VM is in N1
  • Manual power off over N1
  • Expecting migration to N2 or N3
  • VM entered in UNKNOWN status and never get migrated, even when the hook was “successfully” executed
  • Sunstone shows the VM in RUNNING state but logs says that the VM is in UNKNOWN state

Test #5
Leader: orchestrator3

  • VM is in N2
  • Manual power off over N2
  • Expecting migration to N1 or N3
  • VM gets migrated to N3

Test #6
Leader: orchestrator3

  • VM is in N3
  • Manual power off over N3
  • New leader should be decided between orchestrator1 or orchestrator2
  • New leader is orchestrator1
  • Expecting migration to N1 or N2
  • VM entered in UNKNOWN status and never get migrated, even when the hook was “successfully” executed

Test #7
Leader: orchestrator1

  • VM is in N3
  • Manual power off over N3
  • Expecting migration to N1 or N2
  • VM entered in UNKNOWN status and never get migrated, even when the hook was “successfully” executed

Test #8
Leader: orchestrator1

  • VM is in N2
  • Manual power off over N2
  • Expecting migration to N1 or N3
  • VM gets migrated to N3

Test #9
Leader: orchestrator1

  • VM is in N1
  • Manual power off over N1
  • New leader should be decided between orchestrator1 or orchestrator2
  • New leader is orchestrator1
  • Expecting migration to N1 or N2
  • VM gets migrated to N3

Results are as follows:

  • 6/9
  • 66% of success
  • VM allways get migrated to N3 when it’s not its turn to be turned off (don’t know how ON decide this, trough the votation (quorum(?) system)). For what it may server VM was originally created on N3.

The official documentations lacks of a lots of details tho. Is there another one with more details?

I now will post some logs to be more clear:

(Pane0 (up left) will show /var/log/one/$VM_ID.log, while Pane1 (up right) will show /var/log/one/monitord.log and Pane2 (down) will show /var/log/one/onehem.log):

This log show the tests when the VM is stuck at the UNKOWN state

(Pane0 (up left) will show /var/log/one/$VM_ID.log, while Pane1 (up right) will show /var/log/one/monitord.log and Pane2 (down) will show /var/log/one/onehem.log):

This log show the tests when the migration is successfully

And this is the only error that oned.log shows that is (I guess) related to the migration tasks:

I’m not really sure if this error is related to the VM’s HA


So, I would like to achieve a 100% success when a node is shutdown (in my tests) or disconnected (it may happen… I mean, that’s what VM’s HA are for, isn’t?) if the node is hosting a VM OpenNebula do a Live migrate of the VM to a live node.

  • What am I doing wrong?
  • What can I do better?
  • Any extra info needed?

Hey guys… Any chances I could get any help???

So it seems that the test that are failing for you are functionally similar from others that works; just node is different. You can access the execution details of the hooks (maybe the hook execute successfully but some operation was not completed) onehook show will show you the executions and -e the details of an specific execution. You can also add some debug info to the hook if there is no relevant info the hook output.