OpenNebula 5.6 RAFT two nodes

razvanc · October 15, 2018, 4:29pm

Hello!

I’m running OpenNebula 5.6.1 on Debian 9, with two nodes in a HA scenario using RAFT. Everything runs OK until the leader fails (or is turn down using systemctl). When I shut the leader down (systemctl stop opennebula), the second node, which (I hope) should become leader, gets stuck in the candidate state:
HA & FEDERATION SYNC STATUS
ID NAME STATE TERM INDEX COMMIT VOTE FED_INDEX
0 10.0.0.2 error - - - - -
1 10.0.0.3 candidate 1006 27400 0 -1 -1

The only logs I see are:
Mon Oct 15 19:25:00 2018 [Z0][RCM][I]: Error requesting vote from follower 0:libcurl failed to execute the HTTP POST transaction, explaining: Failed to connect to 10.0.0.2 port 2633: Connection refused
Mon Oct 15 19:25:00 2018 [Z0][RCM][I]: No leader found, starting new election in 2790ms

I am expecting to see these errors, since the leader node is down, but my expectations are also that the failover node (10.0.0.3) to take over and become leader. Are my expectations correct, or there is something wrong with my scenario?

Thank you!
Răzvan

atodorov_storpool · October 15, 2018, 6:37pm

Hi @razvanc,

OpenNebula recommends 3 or 5 nodes

The RAFT consensus algorithm needs to have N/2+1 nodes available to create a quorum. The remaining node in your case is in split-brain situation waiting for other node(s) to become available to start the election.

Hope this helps.

Best Regards,
Anton Todorov

razvanc · October 16, 2018, 7:32am

Hi, Anton!

Thank you for your prompt response! This was actually what I was thinking too, but I couldn’t pinpoint the hard requirement of having N/2+1 nodes, I only saw the recommendation you pointed out. TBH, I didn’t read the RAFT specifications, I am sorry about that!
Do you know if there is a method of adding a 3rd, lightweight node (not an actual installment), just for ensuring consensus? Not sure whether only deploying a RAFT generic implementation will help, since it will still need to implement some of the OpenNebula logic.

Thanks,
Răzvan

atodorov_storpool · October 16, 2018, 8:26am

Hi Razvan,

I am not sure is it possible to add just a voting beacon. Please feel free to issue a feature request though.

Best Regards,
Anton Todorov

razvanc · October 16, 2018, 9:54am

Done! One can follow the feature request here.

Thank you very much for your help!
Răzvan

petr108m · January 23, 2019, 8:58am

status of feature
Code committed to upstream release/hotfix branches

does it mean a possibility to download and install?
can u provide details?

razvanc · January 23, 2019, 11:39am

According to the ticket, nothing was done yet - those are just bullets that need to be checked when completed.

ruben · February 4, 2019, 9:39am

Hi,

AFAIK this is not possible for RAFT. A node must be leader, follower or candidate. Note that log entries are committed once a majority of followers have replicated the entry, so the algorithm assumes that any of them could take the leadership in case of failure…

I guess the light way approach would be to create a VM with your third oned server running in it…

Bishop · July 24, 2019, 10:37pm

A node must be leader, follower or candidate.

No question there – any node should be able to take over as RAFT leader if another fails.

Who says that node needs to also be able to take over as oned leader? It seems knowing which oned is the leader is a piece of info that can be passed from raft node to node, without either of those nodes being that leader, in much the same way we all can agree which dog is my dog without any of us being my dog.

Topic		Replies	Views
Leader is not selected. RAFT on opennebula 5.8.1 Operations solved	2	1277	June 28, 2019
OpenNebula RAFT HA questions Product Support	1	693	March 5, 2018
Unavailable RPC on Leder node. RAFT on opennebula 5.6.2 Operations	7	1079	April 22, 2019
Incomprehensible behavior RAFT Product Support	2	350	June 18, 2021
Frontend HA RAFT problems Product Support	28	3482	August 2, 2017

OpenNebula 5.6 RAFT two nodes

Related topics