HA opennebula test cluster - deploy_success_action, VM in a wrong state

matchett808 · March 27, 2022, 3:02am

Appears that if a VM finishes deployment during the election or shortly thereafter

Versions of the related components and OS (frontend, hypervisors, VMs):
ONE 6.2
3* hosts, relatively low power though - dual core crap things - this likely isn’t a problem on beefier hosts
Installed as per the HA frontend guide (I think at least)
NFS datastores

I’ve tried upping the election timeout and my XMLRPC_TIMEOUT_MS is 0 as per other threads on similar topics

Steps to reproduce:
Deploy a VM - using alpine and just letting it deploy on any host in my HA cluster

Current results:

During deployment, the RAFT triggers an election - this then seems to cause some split brain confusion or similar

Vm log may include:

deploy_success_action, VM in a wrong state

or oned log may include:

Tried to modify DB being a follower

Expected results:

VM deploys itself automatically

matchett808 · March 27, 2022, 3:10am

I’m going to leave this question up to get a better answer - I upped ELECTION_TIMEOUT_MS all the way to 30000 (default was 500) - things seem to be working fine enough now but this feels a tad on the hacky side (although this use case isn’t quite right - this is purely a test cluster running on terrible hardware lol)

ruben · March 28, 2022, 8:27am

It makes sense to update the ELECTION_TIMEOUT to the latency of your network and/or server performance. This time roughly estimates how long you can wait for the leader heartbeat, so it may need to be tuned for the leader “pace”…

Anyway if you have the log files (oned.log) for leader/follower we can give it a look

matchett808 · May 20, 2022, 10:20am

Thanks, I’ve since ran an identical setup with better hardware and the value seemed to scale (had it down at 1000) without issue - I can’t understate how terrible the hardware I was using at first was lol

Topic		Replies	Views
OpenNebula RAFT HA questions Product Support	1	691	March 5, 2018
VM deployment FAILED, host with openvswitch, crash in nic.rb Product Support	4	1353	March 26, 2015
OpenNebula 5.6 RAFT two nodes Operations	8	1797	July 24, 2019
RAFT issue since upgrade Product Support	12	1450	August 1, 2018
VM entry stuck in Poweroff and on wrong host Product Support	2	1755	October 19, 2016

HA opennebula test cluster - deploy_success_action, VM in a wrong state

Related topics