Leader is not selected. RAFT on opennebula 5.8.1

Hi.
I have RAFT on opennebula 5.8.1 (Ubuntu 18.04.1 LTS) with 3-nodes

case №1
If two nodes is not available (one node slave and node leader), error state.
The remaining node goes from slave state to candidate state. And the node remains in a candidate state all the time.

onezone show 0

ZONE 0 INFORMATION
ID : 0
NAME : OpenNebula

ZONE SERVERS
ID NAME ENDPOINT
0 onenode-1 http://10.191.171.9:2633/RPC2
1 onenode-2 http://10.191.171.30:2633/RPC2
2 onenode-3 http://10.191.171.21:2633/RPC2

HA & FEDERATION SYNC STATUS
ID NAME STATE TERM INDEX COMMIT VOTE FED_INDEX
0 onenode-1 error - - - - -
1 onenode-2 error - - - - -
2 onenode-3 candidate 20422 1389470 1389470 -1 -1

ZONE TEMPLATE
ENDPOINT=“http://localhost:2633/RPC2
root@onenode-3:~# onezone show 0
ZONE 0 INFORMATION
ID : 0
NAME : OpenNebula

ZONE SERVERS
ID NAME ENDPOINT
0 onenode-1 http://10.191.171.9:2633/RPC2
1 onenode-2 http://10.191.171.30:2633/RPC2
2 onenode-3 http://10.191.171.21:2633/RPC2

HA & FEDERATION SYNC STATUS
ID NAME STATE TERM INDEX COMMIT VOTE FED_INDEX
0 onenode-1 error - - - - -
1 onenode-2 error - - - - -
2 onenode-3 candidate 20431 1389470 1389470 -1 -1

ZONE TEMPLATE
ENDPOINT=“http://localhost:2633/RPC2

case №2
If two nodes is not available (two node slave), error state.
Аfter about 5 minutes stops responding API (http://{FIP}:2633/RPC2 and http://{node-IP}:2633/RPC2)
Error “ERR_CONNECTION_TIMED_OUT”

On command “onezone show 0” or “onevm list” etc…
I get the answer “execution expired”

Config RAFT in oned.conf

#*******************************************************************************
FEDERATION = [
    MODE          = "STANDALONE",
    ZONE_ID       = 0,
    SERVER_ID     = 2,    ### use 0,1,2 
    MASTER_ONED   = ""
]

RAFT = [
    LIMIT_PURGE          = 100000,
    LOG_RETENTION        = 250000,
    LOG_PURGE_TIMEOUT    = 60,
    ELECTION_TIMEOUT_MS  = 5000,
    BROADCAST_TIMEOUT_MS = 500,
    XMLRPC_TIMEOUT_MS    = 1000
]

# Executed when a server transits from follower->leader
 RAFT_LEADER_HOOK = [
     COMMAND = "raft/vip.sh",
     ARGUMENTS = "leader ens3 10.191.171.100/23"
 ]

# Executed when a server transits from leader->follower
 RAFT_FOLLOWER_HOOK = [
     COMMAND = "raft/vip.sh",
     ARGUMENTS = "follower ens3 10.191.171.100/23"
 ]
#*******************************************************************************

Hi @barte1by

RAFT election algorithm can elect a node only if the quorum size of the cluster is |N/2 + 1| of running node.
Here N = 3 so |N/2 + 1| = 2. this means that you need to have at least 2 nodes (1 node down) to have a leader.
That is why, when you have 2 nodes downs the remaining node is in “candidate” state.

When the cluster has only one running node, no leader, it means that the DB cannot be updated.

If you want to support 2 node loss, you have to have 5 nodes in your Front End cluster instead of 3.

Here is the link to the documentation page for more details: http://docs.opennebula.org/5.8/advanced_components/ha/frontend_ha_setup.html?highlight=raft#requirements-and-architecture

Hope my explanation is clear.

Regards

Jean-Philippe

1 Like

thanks for the clarifications