RAFT issue since upgrade

Hello,

since our upgrade from 5.4 to 5.6 we are experiencing RAFT leader takeovers.

Failed to get heartbeat from leader. Starting election proccess

In the upgrade document from 5.4 it says:

Please adjust the XMLRPC_TIMEOUT according to the typical RTT (round-trip-time) of xml-rpc calls across RAFT servers. This value should be 4 or 5 times the average RTT

http://docs.opennebula.org/5.6/intro_release_notes/upgrades/upgrade_54.html

In the upgrade document from 5.6 it says:

If you are using a HA setup you need to manually change the value RAFT/XMLRPC_TIMEOUT_MS to 0 before restarting the OpenNebula daemons.

http://docs.opennebula.org/5.6/intro_release_notes/upgrades/upgrade_56.html

If we are on 5.6 what is now the right value?

The note in the 5.6 upgrade guide is misleading, we have removed the note, thanks for the heads up.

Please adjust XMLRPC_TIMEOUT_MS to the typical RTT (round-trip-time) of xml-rpc calls across RAFT servers. This value should be 4 or 5 times the average RTT. Also please ensure that the election timeout is not too low.

Thank you @tinova. We will go ahead and adjust the values.

Hey @tinova. We have adjusted the values but our cluster still seems to shake around.

We are using the following values:

RAFT = [
    LIMIT_PURGE          = 100000,
    LOG_RETENTION        = 500000,
    LOG_PURGE_TIMEOUT    = 600,
    ELECTION_TIMEOUT_MS  = 2500,
    BROADCAST_TIMEOUT_MS = 500,
    XMLRPC_TIMEOUT_MS    = 1000
] 

I have also tried different XMLRPC_TIMEOUT_MS values.

Log of one follower, which was leader before:

Mon Jul 30 09:30:42 2018 [Z0][DBM][E]: Tried to modify DB being a follower
Mon Jul 30 09:30:42 2018 [Z0][DBM][E]: Tried to modify DB being a follower
Mon Jul 30 09:30:42 2018 [Z0][RCM][I]: oned is set to follower mode
Mon Jul 30 09:30:42 2018 [Z0][RCM][I]: Replication thread stopped
Mon Jul 30 09:30:42 2018 [Z0][VMM][D]: VM 530 successfully monitored: STATE=a CPU=14.11 MEMORY=25263032 NETRX=37417309843 NETTX=52699153043 DISKRDBYTES=43762943544 DISKWRBYTES=676456159232 DISKRDIOP
S=1741159 DISKWRIOPS=33531267
Mon Jul 30 09:30:42 2018 [Z0][DBM][E]: Tried to modify DB being a follower
Mon Jul 30 09:30:42 2018 [Z0][DBM][E]: Tried to modify DB being a follower
Mon Jul 30 09:30:42 2018 [Z0][RCM][I]: oned is set to follower mode
Mon Jul 30 09:30:42 2018 [Z0][RCM][I]: oned is set to follower mode
Mon Jul 30 09:30:42 2018 [Z0][ReM][D]: Req:4640 UID:0 one.zone.voterequest result SUCCESS, 72936
Mon Jul 30 09:30:42 2018 [Z0][RCM][I]: oned is set to follower mode
Mon Jul 30 09:30:42 2018 [Z0][ReM][E]: Req:224 UID:0 one.zone.voterequest result FAILURE [one.zone.voterequest] Candidate's log is outdated
Mon Jul 30 09:30:43 2018 [Z0][RCM][I]: oned is set to follower mode
Mon Jul 30 09:30:43 2018 [Z0][ReM][E]: Req:6048 UID:0 one.zone.voterequest result FAILURE [one.zone.voterequest] Candidate's log is outdated
Mon Jul 30 09:30:43 2018 [Z0][RCM][I]: oned is set to follower mode
Mon Jul 30 09:30:43 2018 [Z0][ReM][E]: Req:1632 UID:0 one.zone.voterequest result FAILURE [one.zone.voterequest] Candidate's log is outdated
Mon Jul 30 09:30:43 2018 [Z0][RCM][I]: oned is set to follower mode
Mon Jul 30 09:30:43 2018 [Z0][ReM][E]: Req:3152 UID:0 one.zone.voterequest result FAILURE [one.zone.voterequest] Candidate's log is outdated
Mon Jul 30 09:30:43 2018 [Z0][RCM][I]: oned is set to follower mode
Mon Jul 30 09:30:43 2018 [Z0][RCM][I]: Replication thread stopped
Mon Jul 30 09:30:43 2018 [Z0][RCM][I]: oned is set to follower mode
Mon Jul 30 09:30:43 2018 [Z0][RCM][I]: Replication thread stopped
Mon Jul 30 09:30:43 2018 [Z0][RCM][I]: oned is set to follower mode
Mon Jul 30 09:30:43 2018 [Z0][RCM][I]: Replication thread stopped
Mon Jul 30 09:30:43 2018 [Z0][RCM][I]: oned is set to follower mode
Mon Jul 30 09:31:12 2018 [Z0][ReM][D]: Req:3200 UID:0 one.zone.raftstatus invoked
Mon Jul 30 09:31:12 2018 [Z0][ReM][D]: Req:3200 UID:0 one.zone.raftstatus result SUCCESS, "<RAFT><SERVER_ID>0</..."

Log of another node:

Mon Jul 30 09:34:46 2018 [Z0][RCM][I]: oned is set to follower mode
Mon Jul 30 09:34:46 2018 [Z0][RCM][I]: Replication thread stopped
Mon Jul 30 09:34:46 2018 [Z0][RCM][I]: oned is set to follower mode
Mon Jul 30 09:34:46 2018 [Z0][RCM][I]: Replication thread stopped
Mon Jul 30 09:35:08 2018 [Z0][ReM][I]: New term (72950) discovered from leader 0
Mon Jul 30 09:35:08 2018 [Z0][ReM][I]: New term (72950) discovered from leader 0
Mon Jul 30 09:35:08 2018 [Z0][RCM][I]: oned is set to follower mode
Mon Jul 30 09:35:09 2018 [Z0][RCM][I]: oned is set to follower mode
Mon Jul 30 09:35:15 2018 [Z0][ReM][D]: Req:2800 UID:0 one.zone.raftstatus invoked
Mon Jul 30 09:35:15 2018 [Z0][ReM][D]: Req:2800 UID:0 one.zone.raftstatus result SUCCESS, "<RAFT><SERVER_ID>2</..."
Mon Jul 30 09:35:45 2018 [Z0][ReM][D]: Req:0 UID:0 one.zone.raftstatus invoked
Mon Jul 30 09:35:45 2018 [Z0][ReM][D]: Req:0 UID:0 one.zone.raftstatus result SUCCESS, "<RAFT><SERVER_ID>2</..."
Mon Jul 30 09:36:06 2018 [Z0][ReM][D]: Req:9584 UID:0 one.zone.voterequest invoked , 72951, 1, 31958730, 72950
Mon Jul 30 09:36:06 2018 [Z0][ReM][I]: New term (72951) discovered from candidate 1
Mon Jul 30 09:36:07 2018 [Z0][RCM][I]: oned is set to follower mode
Mon Jul 30 09:36:07 2018 [Z0][ReM][E]: Req:9584 UID:0 one.zone.voterequest result FAILURE [one.zone.voterequest] Candidate's log is outdated
Mon Jul 30 09:36:09 2018 [Z0][RRM][E]: Failed to get heartbeat from leader. Starting election proccess
Mon Jul 30 09:36:10 2018 [Z0][RCM][I]: Error requesting vote from follower 0:RPC call timed out and aborted
Mon Jul 30 09:36:10 2018 [Z0][RCM][I]: Vote not granted from follower 1: [one.zone.voterequest] Already voted for another candidate
Mon Jul 30 09:36:10 2018 [Z0][RCM][I]: No leader found, starting new election in 3246ms
Mon Jul 30 09:36:10 2018 [Z0][ReM][D]: Req:8480 UID:0 one.zone.voterequest invoked , 72952, 1, 31958730, 72950
Mon Jul 30 09:36:10 2018 [Z0][ReM][E]: Req:8480 UID:0 one.zone.voterequest result FAILURE [one.zone.voterequest] Candidate's log is outdated
Mon Jul 30 09:36:14 2018 [Z0][ReM][D]: Req:5824 UID:0 one.zone.voterequest invoked , 72953, 1, 31958730, 72950
Mon Jul 30 09:36:14 2018 [Z0][ReM][E]: Req:5824 UID:0 one.zone.voterequest result FAILURE [one.zone.voterequest] Candidate's log is outdated
Mon Jul 30 09:36:14 2018 [Z0][RCM][I]: Error requesting vote from follower 0:RPC call timed out and aborted
Mon Jul 30 09:36:14 2018 [Z0][RCM][I]: Got vote from follower 1. Total votes: 1
Mon Jul 30 09:36:14 2018 [Z0][RCM][I]: Got majority of votes
Mon Jul 30 09:36:15 2018 [Z0][RCM][I]: Becoming leader of the zone. Last log record: 31958768 last applied record: 31958768
Mon Jul 30 09:36:15 2018 [Z0][RCM][I]: oned is now the leader of the zone
Mon Jul 30 09:36:15 2018 [Z0][ReM][D]: Req:2000 UID:0 one.zone.raftstatus invoked
Mon Jul 30 09:36:15 2018 [Z0][ReM][D]: Req:2000 UID:0 one.zone.raftstatus result SUCCESS, "<RAFT><SERVER_ID>2</..."
Mon Jul 30 09:36:15 2018 [Z0][ReM][D]: Req:9600 UID:0 one.vmpool.info invoked , -2, -1, -1, -1
Mon Jul 30 09:36:15 2018 [Z0][ReM][D]: Req:9600 UID:0 one.vmpool.info result SUCCESS, "<VM_POOL><VM><ID>74<..."
Mon Jul 30 09:36:15 2018 [Z0][ReM][D]: Req:1280 UID:0 one.vmpool.info invoked , -2, -1, -1, -1
Mon Jul 30 09:36:15 2018 [Z0][ReM][D]: Req:1280 UID:0 one.vmpool.info result SUCCESS, "<VM_POOL><VM><ID>74<..."

Seems like suddenly a follower is taking over mastership. We have evaluated the technical setup and do not find an issue like network problems or something. The whole setup has worked von 5.4 flawlessly.

We have now played with several different values but the cluster is shaking around. We can exclude network issue, as we have changed network cabling und using full hardware nodes (no VMs) connected to the same switch.

We do not see any performance bottle-necks for answering API calls.

We have now also tried to set XMLRPC_TIMEOUT_MS to “0”. But this also does not help.

Are there any other points to debug this issue? Or any other advises how to set the correct and working values?

I’d suggest increasing the XMLRPC_TIMEOUT_MS.

@tinova we are working now with a timeout of 8000 for XMLRPC_TIMEOUT_MS but we are still receiving:

Tue Jul 31 11:10:29 2018 [Z0][RRM][E]: Failed to get heartbeat from leader. Starting election proccess
Tue Jul 31 11:10:32 2018 [Z0][ReM][D]: Req:2160 UID:0 one.zone.raftstatus invoked
Tue Jul 31 11:10:32 2018 [Z0][ReM][D]: Req:2160 UID:0 one.zone.raftstatus result SUCCESS, "<RAFT><SERVER_ID>1</..."
Tue Jul 31 11:10:37 2018 [Z0][RCM][I]: Error requesting vote from follower 0:RPC call timed out and aborted
Tue Jul 31 11:10:37 2018 [Z0][RCM][I]: Vote not granted from follower 2: [one.zone.voterequest] Already voted for another candidate
Tue Jul 31 11:10:37 2018 [Z0][RCM][I]: No leader found, starting new election in 2790ms
Tue Jul 31 11:10:37 2018 [Z0][ReM][D]: Req:7248 UID:0 one.zone.voterequest invoked , 76049, 2, 32006892, 76048
Tue Jul 31 11:10:37 2018 [Z0][ReM][D]: Req:7248 UID:0 one.zone.voterequest result SUCCESS, 76049
Tue Jul 31 11:10:37 2018 [Z0][RCM][I]: oned is set to follower mode 

on our followers.

I think there seems something stuck with the RAFT algorithm. Our typical RTT between the servers is 0,2ms

PING 10.40.0.2 (10.40.0.2) 56(84) bytes of data.
64 bytes from 10.40.0.2: icmp_seq=1 ttl=64 time=0.213 ms
64 bytes from 10.40.0.2: icmp_seq=2 ttl=64 time=0.135 ms
64 bytes from 10.40.0.2: icmp_seq=3 ttl=64 time=0.090 ms
64 bytes from 10.40.0.2: icmp_seq=4 ttl=64 time=0.155 ms
64 bytes from 10.40.0.2: icmp_seq=5 ttl=64 time=0.096 ms
64 bytes from 10.40.0.2: icmp_seq=6 ttl=64 time=0.118 ms
64 bytes from 10.40.0.2: icmp_seq=7 ttl=64 time=0.160 ms
64 bytes from 10.40.0.2: icmp_seq=8 ttl=64 time=0.157 ms
64 bytes from 10.40.0.2: icmp_seq=9 ttl=64 time=0.167 ms
64 bytes from 10.40.0.2: icmp_seq=10 ttl=64 time=0.185 ms
64 bytes from 10.40.0.2: icmp_seq=11 ttl=64 time=0.160 ms
64 bytes from 10.40.0.2: icmp_seq=12 ttl=64 time=0.162 ms
64 bytes from 10.40.0.2: icmp_seq=13 ttl=64 time=0.159 ms
64 bytes from 10.40.0.2: icmp_seq=14 ttl=64 time=0.161 ms
64 bytes from 10.40.0.2: icmp_seq=15 ttl=64 time=0.162 ms

So all in all there seems to be something more problematic in 5.6. The same setup worked fluently with 5.4! We closely monitoring the RTT and do not have any variations.

I think a timeout of 8000ms is pretty huge! What are other steps to debug?

I have now played around the whole day with all of the variables but I simply cannot get our cluster in a stable state. Since 5.6 there is something stuck with RAFT. It keeps shaking around no matter how high I adjust the timeouts.

I tried with XMLRPC_TIMEOUT_MS upto 16000ms, ELECTION_TIMEOUT_MS upto 16000ms. I simply cannot get it stable. :frowning:

I even tried setting XMLRPC_TIMEOUT_MS to 0ms which should be infinite. But still no luck.

I finally managed to get a stable cluster with the following values:

RAFT = [
LIMIT_PURGE = 100000,
LOG_RETENTION = 250000,
LOG_PURGE_TIMEOUT = 600,
ELECTION_TIMEOUT_MS = 16000,
BROADCAST_TIMEOUT_MS = 500,
XMLRPC_TIMEOUT_MS = 16000
]

The broadcast are send every 500ms and election timeout is set to 16000ms - a really huge value. I do not know if there have been made changes to the RAFT function in 5.6 but it is definetely not working as on 5.4.

Once again, I can exclude any network latencies or problems.

I think the main change is e XMLRPC_TIMEOUT

Notes:

  1. Currently heartbeats are sent sequentially, so you need to account for that in the ELECTION timeout, I’d say that 5s (5000) should work in most cases. This needs to be improved and dedicate a separate thread for the heartbeats
  2. 16s for XMLRPC_TIMEOUT is too high as a failure operation would probably trigger an election

Could you start with:

RAFT = [
    LOG_RETENTION        = 100000,
    LOG_PURGE_TIMEOUT    = 60,
    ELECTION_TIMEOUT_MS  = 10000,
    BROADCAST_TIMEOUT_MS = 500,
    XMLRPC_TIMEOUT_MS    = 1500

And try to reduce ELECTION if it is stable for your configuration.

2 Likes

Hey @ruben,

I have adjusted the values to the following:

RAFT = [
    LIMIT_PURGE          = 100000,
    LOG_RETENTION        = 100000,
    LOG_PURGE_TIMEOUT    = 60,
    ELECTION_TIMEOUT_MS  = 5000,
    BROADCAST_TIMEOUT_MS = 500,
    XMLRPC_TIMEOUT_MS    = 1500
]

It is stable now. Thank you very much! :slight_smile:

But from time to time I get the following on one the followers:

Wed Aug  1 09:36:01 2018 [Z0][ONE][E]: SQL command was: INSERT INTO logdb (log_index, term, sqlcmd, timestamp, fed_index) VALUES (32057753,76222,'eJyFVmtvozgU/Sv
eT22lPLCxgUgILSXeBjUvAcl2NBohtvFMohLIUNKZzq9f2zzsZLsaRYJz7jn34hj7moiu535AQbhMVmBfvtbpqSxzcFsedgNQZEc2AP+Uu/cBeK2zmpM845ZjWaT1QWhnYfsmLuWPglXpmbOqPJ8EKOt9E3k+7O7A
1p9vaAxu4eDm7VDVBroZ3LizVZx4bjj1oDvmV3fpL6jXyO5YEjdO/IR6nDbADRfpwufWPz4HUz/xP7+8Hb988Xh6G9/+j97F536cpIvVMk1CXh4S04SGMyETd3ypuMF8Eyc0Svm4DHessU7xduxrds7rXvTk/0njm
R/x/GkYP6ab2H+gIl9j7oIuWggxsQkyLOyOVdAN1psWmo4hnt1zN1kl/jzlXg9Zps0HbnFdBVudJ3gmMnpJcHfhP6ViFB6B0Obz3XOpdCWh5ZioEWVBAUQ6tkS5jrl/RZQ22diZQK6oQKOJZAgt2+Z/wWllWVAiOU
DDRq0gS25iOm0qIGRxRfFGkhWxNTEJJJ0sK0okSkDktHFZMNosl+HyId0uYg/xIercFWsjTlYR5Xisk3UQplO6DQOpXDL99coqYuVCs1m7AptWj02MNEw0rHlsR2FHxxOFJ6oOFu+gx5aGbQ2rXIzU2LCjeSZqPMR
QHgJVfWJqmLR4LP91Qhe8aYjN6EfBTG21n46VWljutkbgryFeU6rtRgQNQxqUJCa12eqdqWkAI2Qe2e6QjXZMZijf7NOaRtswXkX/3ea69rtW8ch7gthbi9WUzmPlw44FTqyoD+djd0cdMDtwqkrwXFZsdy5BgWxD
EiTYd3Y8mwjwp/Hr8+ls4WG1ZznRsCVUC0srvwVlUZUMrFlRvRdgyfZZzo7dfRjeRzH4m73WR1axHjThOCt27/fVYfeN6bgRw7dO6lEjzLLXHyzPh0WZxE+X7MLQSzJ4X5XZTsu74lcmTW5G+vKeZy9sGOQHPoFX9
NISs+qNVVe0sWT1Pi8LcNqzojyC1almFT+LHqCCSEFTQawgAXT9KZAXXnN9L5fE9VJ4lMdFMAuXVFsYp+fhAWPj60/5Eu0RGRng9AyuwwYPy7c8MnuEe2TIpMsM+EEMNbHvJmkDqhi5llR1q0ewR+iD6uYHMXxdVj
yHs36G1IzIebrctWFRs/w2ugNPrCzEnc8noGTIm7kB3jD4E6ARNB5mv2Q5rcCSJtGTtvssy3IIsqFjQKfpJq2D3xLNCB1CJjbmTsxPtM4oHLyR8ybQnQOdfwhx23wu9Z7J06R3246NLt1S39IoDldLZSRizptPjE7
63TfIWHVQ2da8mwEaqE+RgcF/UF6Nu38B3SDLRA==',0,-1), error 1062 : Duplicate entry '32057753' for key 'PRIMARY'
Wed Aug  1 09:36:01 2018 [Z0][DBM][E]: Duplicated log record

Yes, this is because of a previous replication API call timeout and was cancelled by client, it may happen that server does not cancelled the request and two replication requests try to write the same record. We are going change INSERT for REPLACE on the follower replication logic, when inserting a record (with right index and term). This way it should be able to deal with timeouts

Thanks for clarifying Ruben! :slight_smile: