Leader is not selected. RAFT on opennebula 5.8.1

barte1by · June 27, 2019, 9:52am

Hi.
I have RAFT on opennebula 5.8.1 (Ubuntu 18.04.1 LTS) with 3-nodes

case №1
If two nodes is not available (one node slave and node leader), error state.
The remaining node goes from slave state to candidate state. And the node remains in a candidate state all the time.

onezone show 0

ZONE 0 INFORMATION
ID : 0
NAME : OpenNebula

ZONE SERVERS
ID NAME ENDPOINT
0 onenode-1 http://10.191.171.9:2633/RPC2
1 onenode-2 http://10.191.171.30:2633/RPC2
2 onenode-3 http://10.191.171.21:2633/RPC2

HA & FEDERATION SYNC STATUS
ID NAME STATE TERM INDEX COMMIT VOTE FED_INDEX
0 onenode-1 error - - - - -
1 onenode-2 error - - - - -
2 onenode-3 candidate 20422 1389470 1389470 -1 -1

ZONE TEMPLATE
ENDPOINT=“http://localhost:2633/RPC2”
root@onenode-3:~# onezone show 0
ZONE 0 INFORMATION
ID : 0
NAME : OpenNebula

ZONE SERVERS
ID NAME ENDPOINT
0 onenode-1 http://10.191.171.9:2633/RPC2
1 onenode-2 http://10.191.171.30:2633/RPC2
2 onenode-3 http://10.191.171.21:2633/RPC2

HA & FEDERATION SYNC STATUS
ID NAME STATE TERM INDEX COMMIT VOTE FED_INDEX
0 onenode-1 error - - - - -
1 onenode-2 error - - - - -
2 onenode-3 candidate 20431 1389470 1389470 -1 -1

ZONE TEMPLATE
ENDPOINT=“http://localhost:2633/RPC2”

case №2
If two nodes is not available (two node slave), error state.
Аfter about 5 minutes stops responding API (http://{FIP}:2633/RPC2 and http://{node-IP}:2633/RPC2)
Error “ERR_CONNECTION_TIMED_OUT”

On command “onezone show 0” or “onevm list” etc…
I get the answer “execution expired”

Config RAFT in oned.conf

#*******************************************************************************
FEDERATION = [
    MODE          = "STANDALONE",
    ZONE_ID       = 0,
    SERVER_ID     = 2,    ### use 0,1,2 
    MASTER_ONED   = ""
]

RAFT = [
    LIMIT_PURGE          = 100000,
    LOG_RETENTION        = 250000,
    LOG_PURGE_TIMEOUT    = 60,
    ELECTION_TIMEOUT_MS  = 5000,
    BROADCAST_TIMEOUT_MS = 500,
    XMLRPC_TIMEOUT_MS    = 1000
]

# Executed when a server transits from follower->leader
 RAFT_LEADER_HOOK = [
     COMMAND = "raft/vip.sh",
     ARGUMENTS = "leader ens3 10.191.171.100/23"
 ]

# Executed when a server transits from leader->follower
 RAFT_FOLLOWER_HOOK = [
     COMMAND = "raft/vip.sh",
     ARGUMENTS = "follower ens3 10.191.171.100/23"
 ]
#*******************************************************************************

jpfoures · June 27, 2019, 12:22pm

Hi @barte1by

RAFT election algorithm can elect a node only if the quorum size of the cluster is |N/2 + 1| of running node.
Here N = 3 so |N/2 + 1| = 2. this means that you need to have at least 2 nodes (1 node down) to have a leader.
That is why, when you have 2 nodes downs the remaining node is in “candidate” state.

When the cluster has only one running node, no leader, it means that the DB cannot be updated.

If you want to support 2 node loss, you have to have 5 nodes in your Front End cluster instead of 3.

Here is the link to the documentation page for more details: http://docs.opennebula.org/5.8/advanced_components/ha/frontend_ha_setup.html?highlight=raft#requirements-and-architecture

Hope my explanation is clear.

Regards

Jean-Philippe

barte1by · June 28, 2019, 4:23am

thanks for the clarifications

Topic		Replies	Views
OpenNebula 5.6 RAFT two nodes HA / Federation	8	1796	July 24, 2019
OpenNebula RAFT HA questions Community Support	1	690	March 5, 2018
Unavailable RPC on Leder node. RAFT on opennebula 5.6.2 HA / Federation	7	1076	April 22, 2019
Incomprehensible behavior RAFT Community Support	2	346	June 18, 2021
Suddenly dies member of HA-RAFT (opennebula 5.6) Community Support	1	464	August 3, 2018

Leader is not selected. RAFT on opennebula 5.8.1

Related topics