VM's HA working randomly

FrankKC · December 19, 2021, 12:32am

Hi guys, I’m trying to get the VM’s HA working correctly, but I can’t get it working correctly.

I have the next hyperconvergent setup:
Node1: Baremetal Ubuntu 20.04 (HP Proliant DL350 g10, 1 micro and 16GB RAM), wich runs the nexts services:

Shared datastorage with the other nodes (a distributed and replicated volumen).
KVM host (installed via opennebula-node package).
LXC host (the package wich comes with Ubuntu). Using an Ubuntu 20.04 container as the orchestrator1, this container runs the OpenNebula FrontEnd (Sunstone) and also act as the orchestrator for the hypervisor runing in the baremetal server.
All OpenNebula 6.2 packages.
IP: 192.168.10.101
Hostname: ON-N1-C1

Node2: Baremetal Ubuntu 20.04 (HP Proliant DL350 g10, 1 micro and 16GB RAM), wich runs the nexts services:

Shared datastorage (a distributed and replicated volumen).
KVM host (installed via opennebula-node package).
LXC host (the package wich comes with Ubuntu). Using an Ubuntu 20.04 container as the orchestrator2, this container runs the OpenNebula FrontEnd (Sunstone) and also act as the orchestrator for the hypervisor runing in the baremetal server.
All OpenNebula 6.2 packages.
IP: 192.168.10.102
Hostname: ON-N1-C2

Node3: Baremetal Ubuntu 20.04 (HP Proliant DL350 g10, 1 micro and 16GB RAM), wich runs the nexts services:

Shared datastorage (a distributed and replicated volumen).
KVM host (installed via opennebula-node package)
LXC host (the package wich comes with Ubuntu). Using an Ubuntu 20.04 container as the orchestrator3, this container runs the OpenNebula FrontEnd (Sunstone) and also act as the orchestrator for the hypervisor runing in the baremetal server.
All OpenNebula 6.2 packages.
IP: 192.168.10.101
Hostname: ON-N1-C1.

I have configured the FrontEnd HA and it works flawlessly. I’m able to create VMs and run/configure services on them.

So I recently read about the VM’s HA and I give it a try. I followed the guides, configure the HA and the hooks (with the ARGUMENTS = "$TEMPLATE -m -p 1 -u" options) and then run some tests to verified if the VM’s HA is working correctly. I run the nexts tests (manual power off means I turn it off and wait more than 10 minutes):

Test #1
Leader: orchestrator2

VM is in N3
Manual power off over N3
Expecting migration to N1 or N2
VM entered in UNKNOWN status and never get migrated, even when the hook was “successfully” executed

Test #2
Leader: orchestrator2

VM is in N1
Manual power off over N1
Expecting migration to N2 or N3
VM gets migrated to N3

Test #3
Leader: orchestrator2

VM is in N2
Manual power off over N2
New leader should be decided between orchestrator1 or orchestrator3
New leader is orchestrator3
Expecting migration to N1 or N3
VM gets migrated to N3

Test #4
Leader: orchestrator3

VM is in N1
Manual power off over N1
Expecting migration to N2 or N3
VM entered in UNKNOWN status and never get migrated, even when the hook was “successfully” executed
Sunstone shows the VM in RUNNING state but logs says that the VM is in UNKNOWN state

Test #5
Leader: orchestrator3

VM is in N2
Manual power off over N2
Expecting migration to N1 or N3
VM gets migrated to N3

Test #6
Leader: orchestrator3

VM is in N3
Manual power off over N3
New leader should be decided between orchestrator1 or orchestrator2
New leader is orchestrator1
Expecting migration to N1 or N2
VM entered in UNKNOWN status and never get migrated, even when the hook was “successfully” executed

Test #7
Leader: orchestrator1

VM is in N3
Manual power off over N3
Expecting migration to N1 or N2
VM entered in UNKNOWN status and never get migrated, even when the hook was “successfully” executed

Test #8
Leader: orchestrator1

VM is in N2
Manual power off over N2
Expecting migration to N1 or N3
VM gets migrated to N3

Test #9
Leader: orchestrator1

VM is in N1
Manual power off over N1
New leader should be decided between orchestrator1 or orchestrator2
New leader is orchestrator1
Expecting migration to N1 or N2
VM gets migrated to N3

Results are as follows:

6/9
66% of success
VM allways get migrated to N3 when it’s not its turn to be turned off (don’t know how ON decide this, trough the votation (quorum(?) system)). For what it may server VM was originally created on N3.

The official documentations lacks of a lots of details tho. Is there another one with more details?

I now will post some logs to be more clear:

(Pane0 (up left) will show /var/log/one/$VM_ID.log, while Pane1 (up right) will show /var/log/one/monitord.log and Pane2 (down) will show /var/log/one/onehem.log):

This log show the tests when the VM is stuck at the UNKOWN state

(Pane0 (up left) will show /var/log/one/$VM_ID.log, while Pane1 (up right) will show /var/log/one/monitord.log and Pane2 (down) will show /var/log/one/onehem.log):

This log show the tests when the migration is successfully

And this is the only error that oned.log shows that is (I guess) related to the migration tasks:

I’m not really sure if this error is related to the VM’s HA

So, I would like to achieve a 100% success when a node is shutdown (in my tests) or disconnected (it may happen… I mean, that’s what VM’s HA are for, isn’t?) if the node is hosting a VM OpenNebula do a Live migrate of the VM to a live node.

What am I doing wrong?
What can I do better?
Any extra info needed?

FrankKC · December 22, 2021, 8:06pm

Hey guys… Any chances I could get any help???

ruben · January 18, 2022, 9:21am

So it seems that the test that are failing for you are functionally similar from others that works; just node is different. You can access the execution details of the hooks (maybe the hook execute successfully but some operation was not completed) onehook show will show you the executions and -e the details of an specific execution. You can also add some debug info to the hook if there is no relevant info the hook output.

Topic		Replies	Views
HA and frontend on hypervisors nodes Community Support	8	2517	March 18, 2019
Single node installation on several nodes, and later join them to create HA cluster General	0	367	December 31, 2018
How to install OpenNebula 5.2.1 front-end and node in the same machine? Community Support	8	4483	May 24, 2019
Running KVM and LXC parallel Community Support	1	325	May 17, 2022
OpenNebula infrastructure design HELP! Community Support	2	725	January 10, 2017

VM's HA working randomly

Results are as follows:

Related Topics