Suddenly dies member of HA-RAFT (opennebula 5.6)

Hi All!

I have tested claster HA-RAFT which consisted from 3 members (opennebula 5.6)
Unfortunaly discovered that one of them members follower or leader it’s doesn’t matter, changed own status to error.

oneadmin@csor3:~$ onezone show 0
ZONE 0 INFORMATION                                                              
ID                : 0                   
NAME              : OpenNebula          


ZONE SERVERS                                                                    
ID NAME            ENDPOINT                                                       
 0 server-1        http://10.93.221.94:2633/RPC2
 1 server-2        http://10.93.221.126:2633/RPC2
 2 server-3        http://10.93.221.61:2633/RPC2

HA & FEDERATION SYNC STATUS                                                     
ID NAME            STATE      TERM       INDEX      COMMIT     VOTE  FED_INDEX 
 0 server-1        error      -          -          -          -     -
 1 server-2        follower   20505      97955812   97955812   2     -1
 2 server-3        leader     20505      97955812   97955812   2     -1

ZONE TEMPLATE                                                                   
ENDPOINT="-"

The last strokes in the log oned.log:

Thu Aug  2 11:55:05 2018 [Z0][InM][D]: Host 10.93.221.54 (143) successfully monitored.
Thu Aug  2 11:55:05 2018 [Z0][InM][D]: Host 10.93.221.49 (144) successfully monitored.
Thu Aug  2 11:55:06 2018 [Z0][ACL][I]: ACL Manager stopped.
Thu Aug  2 11:55:06 2018 [Z0][VMM][I]: Stopping Virtual Machine Manager...
Thu Aug  2 11:55:06 2018 [Z0][LCM][I]: Stopping Life-cycle Manager...
Thu Aug  2 11:55:06 2018 [Z0][LCM][I]: Life-cycle Manager stopped.
Thu Aug  2 11:55:06 2018 [Z0][TM][I]: Stopping Transfer Manager...
Thu Aug  2 11:55:06 2018 [Z0][DiM][I]: Stopping Dispatch Manager...
Thu Aug  2 11:55:06 2018 [Z0][DiM][I]: Dispatch Manager stopped.
Thu Aug  2 11:55:06 2018 [Z0][InM][I]: Stopping Information Manager...
Thu Aug  2 11:55:06 2018 [Z0][ReM][I]: Stopping Request Manager...
Thu Aug  2 11:55:06 2018 [Z0][AuM][I]: Stopping Authorization Manager...
Thu Aug  2 11:55:06 2018 [Z0][HKM][I]: Stopping Hook Manager...
Thu Aug  2 11:55:06 2018 [Z0][ImM][I]: Stopping Image Manager...
Thu Aug  2 11:55:06 2018 [Z0][MKP][I]: Stopping Marketplace Manager...
Thu Aug  2 11:55:06 2018 [Z0][IPM][I]: Stopping IPAM Manager...
Thu Aug  2 11:55:06 2018 [Z0][RCM][I]: Raft Consensus Manager...
Thu Aug  2 11:55:06 2018 [Z0][FRM][I]: Federation Replica Manager...
Thu Aug  2 11:55:06 2018 [Z0][FRM][I]: Federation Replica Manger stopped.
Thu Aug  2 11:55:06 2018 [Z0][ReM][I]: XML-RPC server stopped.
Thu Aug  2 11:55:06 2018 [Z0][ReM][I]: Request Manager stopped.
Thu Aug  2 11:55:07 2018 [Z0][AuM][I]: Authorization Manager stopped.
Thu Aug  2 11:55:07 2018 [Z0][MKP][I]: Marketplace Manager stopped.
Thu Aug  2 11:55:07 2018 [Z0][HKM][I]: Hook Manager stopped.
Thu Aug  2 11:55:07 2018 [Z0][ImM][I]: Image Manager stopped.
Thu Aug  2 11:55:07 2018 [Z0][RCM][I]: oned is set to follower mode
Thu Aug  2 11:55:07 2018 [Z0][RCM][I]: Raft Consensus Manager stopped.
Thu Aug  2 11:55:07 2018 [Z0][InM][I]: Information Manager stopped.
Thu Aug  2 11:55:07 2018 [Z0][IPM][I]: IPAM Manager stopped.
Thu Aug  2 11:55:07 2018 [Z0][VMM][I]: Virtual Machine Manager stopped.
Thu Aug  2 11:55:07 2018 [Z0][RCM][I]: Replication thread stopped
Thu Aug  2 11:55:07 2018 [Z0][RCM][I]: Replication thread stopped
Thu Aug  2 11:55:07 2018 [Z0][RCM][I]: Replication thread stopped
Thu Aug  2 11:55:07 2018 [Z0][RCM][I]: Replication thread stopped
Thu Aug  2 11:55:07 2018 [Z0][TrM][I]: Transfer Manager stopped.
Thu Aug  2 11:55:07 2018 [Z0][ONE][I]: All modules finalized, exiting.

RAFT parameters:

RAFT = [
    LIMIT_PURGE          = 100000,
    LOG_RETENTION        = 100000, #500000
    LOG_PURGE_TIMEOUT    = 60, #600
    ELECTION_TIMEOUT_MS  = 5000, #2500
    BROADCAST_TIMEOUT_MS = 500,
    XMLRPC_TIMEOUT_MS    = 1500 #450
]

May be it happen from wrong parameters of RAFT?

It’s seems I picked up the combination of parameters when work of HA-RAFT look like stable in my case.
RAFT=BROADCAST_TIMEOUT_MS=500,ELECTION_TIMEOUT_MS=5000,LIMIT_PURGE=100000,LOG_PURGE_TIMEOUT=600,LOG_RETENTION=500000,XMLRPC_TIMEOUT_MS=1500

1 Like