HA opennebula test cluster - deploy_success_action, VM in a wrong state

Appears that if a VM finishes deployment during the election or shortly thereafter


Versions of the related components and OS (frontend, hypervisors, VMs):
ONE 6.2
3* hosts, relatively low power though - dual core crap things - this likely isn’t a problem on beefier hosts
Installed as per the HA frontend guide (I think at least)
NFS datastores

I’ve tried upping the election timeout and my XMLRPC_TIMEOUT_MS is 0 as per other threads on similar topics

Steps to reproduce:
Deploy a VM - using alpine and just letting it deploy on any host in my HA cluster

Current results:

During deployment, the RAFT triggers an election - this then seems to cause some split brain confusion or similar

Vm log may include:

deploy_success_action, VM in a wrong state

or oned log may include:

Tried to modify DB being a follower

Expected results:

VM deploys itself automatically

I’m going to leave this question up to get a better answer - I upped ELECTION_TIMEOUT_MS all the way to 30000 (default was 500) - things seem to be working fine enough now but this feels a tad on the hacky side (although this use case isn’t quite right - this is purely a test cluster running on terrible hardware lol)

It makes sense to update the ELECTION_TIMEOUT to the latency of your network and/or server performance. This time roughly estimates how long you can wait for the leader heartbeat, so it may need to be tuned for the leader “pace”…

Anyway if you have the log files (oned.log) for leader/follower we can give it a look

Thanks, I’ve since ran an identical setup with better hardware and the value seemed to scale (had it down at 1000) without issue - I can’t understate how terrible the hardware I was using at first was lol