Opennebula.service unecpectectedly dies

Have you tested the patches? I have applied them an my HA cluster remains stuck.

It seems like it is not possible to select a leader. All nodes are trying to select a leader but only get into candidate state.

They want through the automatic testing, but I’ll take a closer look.

This is my status:

HA & FEDERATION SYNC STATUS
ID NAME            STATE      TERM       INDEX      COMMIT     VOTE  FED_INDEX
 0 server-0        follower   60399      274158     274158     -1    -1
 1 server-1        candidate  60399      274132     274131     1     -1
 2 server-2        follower   60399      274158     274158     2     -1

When I switch back to the old code it works without a problem.

Sometimes a leader ist elected but shortly after looses leader status:

Fri Aug  4 13:04:09 2017 [Z0][RCM][I]: Got majority of votes
Fri Aug  4 13:04:12 2017 [Z0][ReM][D]: Req:4080 UID:0 one.zone.voterequest invoked , 60410, 2, 274158, 60396
Fri Aug  4 13:04:12 2017 [Z0][ReM][D]: Req:4928 UID:0 one.zone.voterequest invoked , 60410, 1, 274132, 60396
Fri Aug  4 13:04:17 2017 [Z0][RCM][I]: Becoming leader of the zone. Last log record: 274158 last applied record: 274158
Fri Aug  4 13:04:17 2017 [Z0][ReM][I]: New term (60410) discovered from candidate 1
Fri Aug  4 13:04:17 2017 [Z0][ReM][I]: New term (60410) discovered from candidate 2
Fri Aug  4 13:04:17 2017 [Z0][RCM][I]: oned is now the leader of the zone
Fri Aug  4 13:04:17 2017 [Z0][RCM][I]: Follower 2 term (60410) is higher than current (60409)
Fri Aug  4 13:04:17 2017 [Z0][RCM][I]: Follower 1 term (60410) is higher than current (60409)
Fri Aug  4 13:04:19 2017 [Z0][MKP][D]: Monitoring marketplace OpenNebula Public (0)
Fri Aug  4 13:04:19 2017 [Z0][InM][D]: Monitoring datastore DONOTUSE_IMAGE (1)
Fri Aug  4 13:04:19 2017 [Z0][InM][D]: Monitoring datastore DONOTUSE_FILES (2)
Fri Aug  4 13:04:19 2017 [Z0][InM][D]: Monitoring datastore ceph_system (100)
Fri Aug  4 13:04:19 2017 [Z0][InM][D]: Monitoring datastore ceph_image (101)
Fri Aug  4 13:04:19 2017 [Z0][InM][D]: Monitoring host virt01 (0)
Fri Aug  4 13:04:21 2017 [Z0][HKM][D]: Message received: EXECUTE SUCCESS -1 RAFT_LEADER_HOOK: ARPING xx.xx.xx.xx from xx.xx.xx.xx bond0

Fri Aug  4 13:04:21 2017 [Z0][HKM][D]: Message received: Sent 5 probes (5 broadcast(s))

Fri Aug  4 13:04:21 2017 [Z0][HKM][D]: Message received: Received 0 response(s)

Fri Aug  4 13:04:22 2017 [Z0][RCM][I]: Replication thread stopped
Fri Aug  4 13:04:22 2017 [Z0][RCM][I]: oned is set to follower mode
Fri Aug  4 13:04:22 2017 [Z0][RCM][I]: Replication thread stopped
Fri Aug  4 13:04:22 2017 [Z0][ReM][E]: Req:4928 UID:0 one.zone.voterequest result FAILURE [one.zone.voterequest] Candidate's log is outdated
Fri Aug  4 13:04:22 2017 [Z0][HKM][D]: Message received: EXECUTE SUCCESS -1 RAFT_FOLLOWER_HOOK:

Fri Aug  4 13:04:24 2017 [Z0][RRM][E]: Failed to get heartbeat from leader. Starting election proccess
Fri Aug  4 13:04:25 2017 [Z0][ReM][D]: Req:4416 UID:0 one.zone.voterequest invoked , 60411, 1, 274132, 60396
Fri Aug  4 13:04:29 2017 [Z0][RCM][I]: oned is set to follower mode
Fri Aug  4 13:04:29 2017 [Z0][ReM][E]: Req:4080 UID:0 one.zone.voterequest result FAILURE [one.zone.voterequest] Already voted for another candidate
Fri Aug  4 13:04:33 2017 [Z0][ReM][D]: Req:9120 UID:0 one.zone.raftstatus invoked
Fri Aug  4 13:04:36 2017 [Z0][RCM][I]: oned is set to follower mode
Fri Aug  4 13:04:36 2017 [Z0][RCM][I]: Replication thread stopped
Fri Aug  4 13:04:36 2017 [Z0][ReM][D]: Req:8608 UID:0 one.zone.voterequest invoked , 60412, 2, 274158, 60396
Fri Aug  4 13:04:40 2017 [Z0][RCM][I]: oned is set to follower mode
Fri Aug  4 13:04:40 2017 [Z0][RCM][I]: Replication thread stopped
Fri Aug  4 13:04:40 2017 [Z0][ReM][E]: Req:4416 UID:0 one.zone.voterequest result FAILURE [one.zone.voterequest] Candidate's log is outdated
Fri Aug  4 13:04:40 2017 [Z0][ReM][I]: New term (60412) discovered from candidate 2
Fri Aug  4 13:04:40 2017 [Z0][ReM][D]: Req:9120 UID:0 one.zone.raftstatus result SUCCESS, "<RAFT><SERVER_ID>0</..."
Fri Aug  4 13:04:40 2017 [Z0][RCM][I]: Vote not granted from follower 1: [one.zone.voterequest] Already voted for another candidate
Fri Aug  4 13:04:40 2017 [Z0][RCM][I]: Vote not granted from follower 2: [one.zone.voterequest] Candidate's term is outdated
Fri Aug  4 13:04:40 2017 [Z0][RCM][I]: Follower 2 is in term 60412 current term is 60411. Turning into follower
Fri Aug  4 13:04:44 2017 [Z0][RCM][I]: oned is set to follower mode
Fri Aug  4 13:04:44 2017 [Z0][ReM][E]: Req:8608 UID:0 one.zone.voterequest result FAILURE [one.zone.voterequest] Candidate's log is outdated
Fri Aug  4 13:04:50 2017 [Z0][RCM][I]: oned is set to follower mode
Fri Aug  4 13:04:50 2017 [Z0][RRM][E]: Failed to get heartbeat from leader. Starting election proccess
Fri Aug  4 13:04:54 2017 [Z0][RCM][I]: Vote not granted from follower 1: [one.zone.voterequest] Already voted for another candidate
Fri Aug  4 13:04:54 2017 [Z0][ReM][D]: Req:2176 UID:0 one.zone.voterequest invoked , 60413, 1, 274132, 60396
Fri Aug  4 13:04:54 2017 [Z0][ReM][E]: Req:2176 UID:0 one.zone.voterequest result FAILURE [one.zone.voterequest] Candidate's log is outdated
Fri Aug  4 13:05:01 2017 [Z0][RCM][I]: Vote not granted from follower 2: [one.zone.voterequest] Already voted for another candidate
Fri Aug  4 13:05:01 2017 [Z0][ReM][D]: Req:6096 UID:0 one.zone.voterequest invoked , 60413, 2, 274158, 60396
Fri Aug  4 13:05:01 2017 [Z0][ReM][E]: Req:6096 UID:0 one.zone.voterequest result FAILURE [one.zone.voterequest] Candidate's log is outdated
Fri Aug  4 13:05:06 2017 [Z0][ReM][D]: Req:8928 UID:0 one.zone.voterequest invoked , 60414, 2, 274158, 60396
Fri Aug  4 13:05:06 2017 [Z0][ReM][E]: Req:8928 UID:0 one.zone.voterequest result FAILURE [one.zone.voterequest] Candidate's log is outdated
Fri Aug  4 13:05:10 2017 [Z0][ReM][D]: Req:7760 UID:0 one.zone.raftstatus invoked
Fri Aug  4 13:05:10 2017 [Z0][ReM][D]: Req:7760 UID:0 one.zone.raftstatus result SUCCESS, "<RAFT><SERVER_ID>0</..."
Fri Aug  4 13:05:10 2017 [Z0][RCM][I]: Vote not granted from follower 1: [one.zone.voterequest] Already voted for another candidate
Fri Aug  4 13:05:10 2017 [Z0][ReM][D]: Req:7920 UID:0 one.zone.voterequest invoked , 60414, 1, 274132, 60396
Fri Aug  4 13:05:10 2017 [Z0][ReM][E]: Req:7920 UID:0 one.zone.voterequest result FAILURE [one.zone.voterequest] Candidate's log is outdated
Fri Aug  4 13:05:10 2017 [Z0][RCM][I]: Vote not granted from follower 2: [one.zone.voterequest] Already voted for another candidate

Maybe you could send me the logs for the 3 servers for the situation you are showing…

Been playing a while with the patch, Cannot get in that situation. It recovers from failure conditions.

Just messaged you directly with logfiles. Problem only exists with raft stability patch.

Just for the records. The problem described by tobx, was because of a large number of log entries.

The length of the log determines how many changes the followers may lay behind the cluster. The actual length is designed to tolerate a couple of days in a medium-low activity cluster. This is probably too much.

In any case, the DB configuration was missing a couple of index to optimized two queries. So I suggest to:

1- Reduce to a half or so the log retention. Note that this may require a bit of tunning based on the activity of your cluster, number of VMs etc… The rule of thumb would be to keep like 1/10th of the records produced in a day to have like ~2hours of automatic recovery timeframe. This can be done simply by writing down the log index (onezone show) with in a day.

2- Create the following indexes in leader + followers. This will be automatically created in 5.4.1

create index fed_index_idx on logdb (fed_index);
create index timestamp_idx on logdb (timestamp);

Cheers