Hey @tinova. We have adjusted the values but our cluster still seems to shake around.
We are using the following values:
RAFT = [
LIMIT_PURGE = 100000,
LOG_RETENTION = 500000,
LOG_PURGE_TIMEOUT = 600,
ELECTION_TIMEOUT_MS = 2500,
BROADCAST_TIMEOUT_MS = 500,
XMLRPC_TIMEOUT_MS = 1000
]
I have also tried different XMLRPC_TIMEOUT_MS values.
Log of one follower, which was leader before:
Mon Jul 30 09:30:42 2018 [Z0][DBM][E]: Tried to modify DB being a follower
Mon Jul 30 09:30:42 2018 [Z0][DBM][E]: Tried to modify DB being a follower
Mon Jul 30 09:30:42 2018 [Z0][RCM][I]: oned is set to follower mode
Mon Jul 30 09:30:42 2018 [Z0][RCM][I]: Replication thread stopped
Mon Jul 30 09:30:42 2018 [Z0][VMM][D]: VM 530 successfully monitored: STATE=a CPU=14.11 MEMORY=25263032 NETRX=37417309843 NETTX=52699153043 DISKRDBYTES=43762943544 DISKWRBYTES=676456159232 DISKRDIOP
S=1741159 DISKWRIOPS=33531267
Mon Jul 30 09:30:42 2018 [Z0][DBM][E]: Tried to modify DB being a follower
Mon Jul 30 09:30:42 2018 [Z0][DBM][E]: Tried to modify DB being a follower
Mon Jul 30 09:30:42 2018 [Z0][RCM][I]: oned is set to follower mode
Mon Jul 30 09:30:42 2018 [Z0][RCM][I]: oned is set to follower mode
Mon Jul 30 09:30:42 2018 [Z0][ReM][D]: Req:4640 UID:0 one.zone.voterequest result SUCCESS, 72936
Mon Jul 30 09:30:42 2018 [Z0][RCM][I]: oned is set to follower mode
Mon Jul 30 09:30:42 2018 [Z0][ReM][E]: Req:224 UID:0 one.zone.voterequest result FAILURE [one.zone.voterequest] Candidate's log is outdated
Mon Jul 30 09:30:43 2018 [Z0][RCM][I]: oned is set to follower mode
Mon Jul 30 09:30:43 2018 [Z0][ReM][E]: Req:6048 UID:0 one.zone.voterequest result FAILURE [one.zone.voterequest] Candidate's log is outdated
Mon Jul 30 09:30:43 2018 [Z0][RCM][I]: oned is set to follower mode
Mon Jul 30 09:30:43 2018 [Z0][ReM][E]: Req:1632 UID:0 one.zone.voterequest result FAILURE [one.zone.voterequest] Candidate's log is outdated
Mon Jul 30 09:30:43 2018 [Z0][RCM][I]: oned is set to follower mode
Mon Jul 30 09:30:43 2018 [Z0][ReM][E]: Req:3152 UID:0 one.zone.voterequest result FAILURE [one.zone.voterequest] Candidate's log is outdated
Mon Jul 30 09:30:43 2018 [Z0][RCM][I]: oned is set to follower mode
Mon Jul 30 09:30:43 2018 [Z0][RCM][I]: Replication thread stopped
Mon Jul 30 09:30:43 2018 [Z0][RCM][I]: oned is set to follower mode
Mon Jul 30 09:30:43 2018 [Z0][RCM][I]: Replication thread stopped
Mon Jul 30 09:30:43 2018 [Z0][RCM][I]: oned is set to follower mode
Mon Jul 30 09:30:43 2018 [Z0][RCM][I]: Replication thread stopped
Mon Jul 30 09:30:43 2018 [Z0][RCM][I]: oned is set to follower mode
Mon Jul 30 09:31:12 2018 [Z0][ReM][D]: Req:3200 UID:0 one.zone.raftstatus invoked
Mon Jul 30 09:31:12 2018 [Z0][ReM][D]: Req:3200 UID:0 one.zone.raftstatus result SUCCESS, "<RAFT><SERVER_ID>0</..."
Log of another node:
Mon Jul 30 09:34:46 2018 [Z0][RCM][I]: oned is set to follower mode
Mon Jul 30 09:34:46 2018 [Z0][RCM][I]: Replication thread stopped
Mon Jul 30 09:34:46 2018 [Z0][RCM][I]: oned is set to follower mode
Mon Jul 30 09:34:46 2018 [Z0][RCM][I]: Replication thread stopped
Mon Jul 30 09:35:08 2018 [Z0][ReM][I]: New term (72950) discovered from leader 0
Mon Jul 30 09:35:08 2018 [Z0][ReM][I]: New term (72950) discovered from leader 0
Mon Jul 30 09:35:08 2018 [Z0][RCM][I]: oned is set to follower mode
Mon Jul 30 09:35:09 2018 [Z0][RCM][I]: oned is set to follower mode
Mon Jul 30 09:35:15 2018 [Z0][ReM][D]: Req:2800 UID:0 one.zone.raftstatus invoked
Mon Jul 30 09:35:15 2018 [Z0][ReM][D]: Req:2800 UID:0 one.zone.raftstatus result SUCCESS, "<RAFT><SERVER_ID>2</..."
Mon Jul 30 09:35:45 2018 [Z0][ReM][D]: Req:0 UID:0 one.zone.raftstatus invoked
Mon Jul 30 09:35:45 2018 [Z0][ReM][D]: Req:0 UID:0 one.zone.raftstatus result SUCCESS, "<RAFT><SERVER_ID>2</..."
Mon Jul 30 09:36:06 2018 [Z0][ReM][D]: Req:9584 UID:0 one.zone.voterequest invoked , 72951, 1, 31958730, 72950
Mon Jul 30 09:36:06 2018 [Z0][ReM][I]: New term (72951) discovered from candidate 1
Mon Jul 30 09:36:07 2018 [Z0][RCM][I]: oned is set to follower mode
Mon Jul 30 09:36:07 2018 [Z0][ReM][E]: Req:9584 UID:0 one.zone.voterequest result FAILURE [one.zone.voterequest] Candidate's log is outdated
Mon Jul 30 09:36:09 2018 [Z0][RRM][E]: Failed to get heartbeat from leader. Starting election proccess
Mon Jul 30 09:36:10 2018 [Z0][RCM][I]: Error requesting vote from follower 0:RPC call timed out and aborted
Mon Jul 30 09:36:10 2018 [Z0][RCM][I]: Vote not granted from follower 1: [one.zone.voterequest] Already voted for another candidate
Mon Jul 30 09:36:10 2018 [Z0][RCM][I]: No leader found, starting new election in 3246ms
Mon Jul 30 09:36:10 2018 [Z0][ReM][D]: Req:8480 UID:0 one.zone.voterequest invoked , 72952, 1, 31958730, 72950
Mon Jul 30 09:36:10 2018 [Z0][ReM][E]: Req:8480 UID:0 one.zone.voterequest result FAILURE [one.zone.voterequest] Candidate's log is outdated
Mon Jul 30 09:36:14 2018 [Z0][ReM][D]: Req:5824 UID:0 one.zone.voterequest invoked , 72953, 1, 31958730, 72950
Mon Jul 30 09:36:14 2018 [Z0][ReM][E]: Req:5824 UID:0 one.zone.voterequest result FAILURE [one.zone.voterequest] Candidate's log is outdated
Mon Jul 30 09:36:14 2018 [Z0][RCM][I]: Error requesting vote from follower 0:RPC call timed out and aborted
Mon Jul 30 09:36:14 2018 [Z0][RCM][I]: Got vote from follower 1. Total votes: 1
Mon Jul 30 09:36:14 2018 [Z0][RCM][I]: Got majority of votes
Mon Jul 30 09:36:15 2018 [Z0][RCM][I]: Becoming leader of the zone. Last log record: 31958768 last applied record: 31958768
Mon Jul 30 09:36:15 2018 [Z0][RCM][I]: oned is now the leader of the zone
Mon Jul 30 09:36:15 2018 [Z0][ReM][D]: Req:2000 UID:0 one.zone.raftstatus invoked
Mon Jul 30 09:36:15 2018 [Z0][ReM][D]: Req:2000 UID:0 one.zone.raftstatus result SUCCESS, "<RAFT><SERVER_ID>2</..."
Mon Jul 30 09:36:15 2018 [Z0][ReM][D]: Req:9600 UID:0 one.vmpool.info invoked , -2, -1, -1, -1
Mon Jul 30 09:36:15 2018 [Z0][ReM][D]: Req:9600 UID:0 one.vmpool.info result SUCCESS, "<VM_POOL><VM><ID>74<..."
Mon Jul 30 09:36:15 2018 [Z0][ReM][D]: Req:1280 UID:0 one.vmpool.info invoked , -2, -1, -1, -1
Mon Jul 30 09:36:15 2018 [Z0][ReM][D]: Req:1280 UID:0 one.vmpool.info result SUCCESS, "<VM_POOL><VM><ID>74<..."
Seems like suddenly a follower is taking over mastership. We have evaluated the technical setup and do not find an issue like network problems or something. The whole setup has worked von 5.4 flawlessly.