Error with leader election Opennebula 5.4 with HA

Andre_Ziviani · July 28, 2017, 5:37pm

Hi,

I managed to install and configure Opennebula 5.4 with builtin HA but after a service restart it stopped working. The error is:
Fri Jul 28 13:57:27 2017 [Z0][RCM][I]: Error requesting vote from follower 0:libcurl failed to execute the HTTP POST transaction, explaining: Operation timed out after 3003 milliseconds with 0 bytes received Fri Jul 28 13:57:27 2017 [Z0][RCM][I]: Error requesting vote from follower 1:libcurl failed to execute the HTTP POST transaction, explaining: Failed to connect to laura2.gfs port 2633: Connection refused

it takes a lot of time to complete the ‘onezone show 0’ command but eventually it return:
root@laura:~# onezone show 0
ZONE 0 INFORMATION
ID : 0
NAME : OpenNebula

ZONE SERVERS                                                                    
ID NAME            ENDPOINT                                                       
 0 laura3          http://laura3.gfs:2633/RPC2
 1 laura2          http://laura2.gfs:2633/RPC2
 2 laura           http://laura.gfs:2633/RPC2

HA & FEDERATION SYNC STATUS                                                     
ID NAME            STATE      TERM       INDEX      COMMIT     VOTE  FED_INDEX 
 0 laura3          error      -          -          -          -
 1 laura2          error      -          -          -          -
 2 laura           candidate  2744       483373     483373     2     -1

ZONE TEMPLATE                                                                   
ENDPOINT="http://localhost:2633/RPC2"

I have 3 servers configured with MariaDB on Debian Stretch.

PS: added a space after http:// due to forum restrictions

ruben · July 28, 2017, 6:35pm

This behavior is normal when servers cannot talk to each other. onezone show is taking longer becasuse laura2 and laura3 are timing out. You need to make sure that the ENDPOINT URL are accesible from all servers (http://laura3.gfs:2633/RPC2…)

Andre_Ziviani · July 28, 2017, 6:39pm

There is no firewall between them but the behavior is not consistent, sometimes I can connect, sometimes not:

root@laura:~# curl laura.gfs:2633/RPC2; echo ""; curl laura2.gfs:2633/RPC2; echo ""; curl laura3.gfs:2633/RPC2; echo "";
<HTML><HEAD><TITLE>Error 405</TITLE></HEAD><BODY><H1>Error 405</H1><P>POST is the only HTTP method this server understands</P><p><HR><b><i><a href="http://xmlrpc-c.sourceforge.net">ABYSS Web Server for XML-RPC For C/C++</a></i></b> version 1.40.0<br></p></BODY></HTML>
curl: (7) Failed to connect to laura2.gfs port 2633: Connection refused

curl: (56) Recv failure: Connection reset by peer

The error above happens on all 3 hosts. When I initially configured all the servers they worked, but after a restart it stopped working

ruben · July 28, 2017, 6:47pm

This network error could be because some interface miss configuration, or any other networking issue … Please also check that the oned’s are actually running

Andre_Ziviani · July 28, 2017, 6:51pm

Interfaces are fine, I have other services (glusterfs, ssh,…) running fine on them and oned is running on all hosts

I don’t think its a network error, sometimes I cant access them on localhost either

root@laura3:~# curl localhost:2633/RPC2
curl: (56) Recv failure: Connection reset by peer

root@laura3:~# ps aux | grep oned
root      2808  0.0  0.0  13084  1036 pts/1    S+   15:52   0:00 grep oned
oneadmin 19246  0.2  0.0 3003944 34348 ?       Ssl  15:43   0:01 /usr/bin/oned -f

ruben · July 28, 2017, 7:32pm

I am not really sure, the error in your setup is because oned’s cannot talk to each other. If there is no firewall, no connectivity issues, then maybe restarting them helps. Check that the port is listening with netstat or ss

Andre_Ziviani · July 28, 2017, 7:34pm

Already restarted but didn’t fix it, oned is listening on the port

root@laura:~# netstat -natup | grep oned
tcp 16 0 0.0.0.0:2633 0.0.0.0:* LISTEN 47408/oned

root@laura2:~# netstat -natup | grep oned
tcp 0 0 0.0.0.0:2633 0.0.0.0:* LISTEN 18276/oned

root@laura3:~# netstat -natup | grep oned
tcp 16 0 0.0.0.0:2633 0.0.0.0:* LISTEN 19246/oned

ruben · July 28, 2017, 8:03pm

Are other commands working?

Andre_Ziviani · July 28, 2017, 8:15pm

No, connection reset by peer on all commands and all hosts

ruben · July 28, 2017, 8:33pm

Sorry Andre, out of ideas, maybe double check firewall, DNS resolves and
that the request are received in the target host with tcpdump…

Good luck.

tobx · July 29, 2017, 10:07am

Check your network setup. Using LACP bonds? MTU set correctly for interfaces?

As Ruben said try dig a bit deeper with tcpdump - this really helped us finding several smaller problems with our network configuration.

Andre_Ziviani · July 29, 2017, 1:18pm

Yes, using LACP bound. MTU is 1500 on all hosts, will check with tcpdump if any errors show up

Andre_Ziviani · July 31, 2017, 1:45pm

I could not find any obvious problem with tcpdump.

I still dont think its network related because when I first configured the hosts it was working fine, it was after the restart of oned that the problems begin. And sometimes I cant access it even on localhost which I think its a problem on oned.

Any other ideas?

ruben · July 31, 2017, 3:16pm

Andre, I was thinking that this maybe your problem:

# Executed when a server transits from follower->leader
RAFT_LEADER_HOOK = [
     COMMAND = "raft/vip.sh",
     ARGUMENTS = "leader eth0 10.3.3.2/24"
]

# Executed when a server transits from leader->follower
RAFT_FOLLOWER_HOOK = [
    COMMAND = "raft/follower.sh",
    ARGUMENTS = "follower eth0 10.3.3.2/24"
]

Do you have the hooks enabled? Could you take a look at the hooks and see if make sense or any potential problem in your setup? They are pretty straightforward

Andre_Ziviani · July 31, 2017, 3:22pm

Yes, I have them enabled

Executed when a server transits from follower->leader

RAFT_LEADER_HOOK = [
COMMAND = “raft/vip.sh”,
ARGUMENTS = “leader brvirt xxx.yyy.zzz.139/27”
]

Executed when a server transits from leader->follower

RAFT_FOLLOWER_HOOK = [
COMMAND = “raft/vip.sh”,
ARGUMENTS = “follower brvirt xxx.yyy.zzz.139/27”
]

Other question, the federation mode is standalone, is that correct (changing the SERVER_ID on each host)?

FEDERATION = [
MODE = “STANDALONE”,
ZONE_ID = 0,
SERVER_ID = 2,
MASTER_ONED = “”
]

ruben · July 31, 2017, 3:29pm

Could you check the scripts, maybe they are not working in your setup and messing every thing.

FEDERATION stanza is correct

Andre_Ziviani · August 1, 2017, 4:22pm

What should happen if two out of three servers goes down? The only one left should be elected leader right? I manually stopped oned on the other hosts and changed their ip so that any check that the remaining server do should timeout (simulating a server crash). The server with oned running is still trying to connect to the others and is still in ‘candidate’ state, that seems wrong to me. Ideas?

ruben · August 1, 2017, 4:51pm

No, that’s the expected behavior. In order to become leader you need a majority, two out of three in your setup. With just one server the cluster cannot work (note that this prevents split brain, conditions).

Andre_Ziviani · August 1, 2017, 5:40pm

Make sense…

I have configured the hostnames in /etc/hosts and not in the dns server, do you think this can be a problem?

Andre_Ziviani · August 1, 2017, 5:53pm

And what does this mean?

one.zone.voterequest result FAILURE [one.zone.voterequest] Candidate's log is outdated

Topic		Replies	Views
Frontend HA RAFT problems Product Support	28	3494	August 2, 2017
OpenNebula RAFT HA questions Product Support	1	695	March 5, 2018
Incomprehensible behavior RAFT Product Support	2	353	June 18, 2021
HA issue: zone 0 is missing a leader Product Support	4	1834	July 19, 2018
Unavailable RPC on Leder node. RAFT on opennebula 5.6.2 Operations	7	1086	April 22, 2019

Error with leader election Opennebula 5.4 with HA

Executed when a server transits from follower->leader

Executed when a server transits from leader->follower

Related topics