Error with leader election Opennebula 5.4 with HA

Hi,

I managed to install and configure Opennebula 5.4 with builtin HA but after a service restart it stopped working. The error is:
Fri Jul 28 13:57:27 2017 [Z0][RCM][I]: Error requesting vote from follower 0:libcurl failed to execute the HTTP POST transaction, explaining: Operation timed out after 3003 milliseconds with 0 bytes received Fri Jul 28 13:57:27 2017 [Z0][RCM][I]: Error requesting vote from follower 1:libcurl failed to execute the HTTP POST transaction, explaining: Failed to connect to laura2.gfs port 2633: Connection refused

it takes a lot of time to complete the ‘onezone show 0’ command but eventually it return:
root@laura:~# onezone show 0
ZONE 0 INFORMATION
ID : 0
NAME : OpenNebula

ZONE SERVERS                                                                    
ID NAME            ENDPOINT                                                       
 0 laura3          http://laura3.gfs:2633/RPC2
 1 laura2          http://laura2.gfs:2633/RPC2
 2 laura           http://laura.gfs:2633/RPC2

HA & FEDERATION SYNC STATUS                                                     
ID NAME            STATE      TERM       INDEX      COMMIT     VOTE  FED_INDEX 
 0 laura3          error      -          -          -          -
 1 laura2          error      -          -          -          -
 2 laura           candidate  2744       483373     483373     2     -1

ZONE TEMPLATE                                                                   
ENDPOINT="http://localhost:2633/RPC2"

I have 3 servers configured with MariaDB on Debian Stretch.

PS: added a space after http:// due to forum restrictions

This behavior is normal when servers cannot talk to each other. onezone show is taking longer becasuse laura2 and laura3 are timing out. You need to make sure that the ENDPOINT URL are accesible from all servers (http://laura3.gfs:2633/RPC2…)

There is no firewall between them but the behavior is not consistent, sometimes I can connect, sometimes not:

root@laura:~# curl laura.gfs:2633/RPC2; echo ""; curl laura2.gfs:2633/RPC2; echo ""; curl laura3.gfs:2633/RPC2; echo "";
<HTML><HEAD><TITLE>Error 405</TITLE></HEAD><BODY><H1>Error 405</H1><P>POST is the only HTTP method this server understands</P><p><HR><b><i><a href="http://xmlrpc-c.sourceforge.net">ABYSS Web Server for XML-RPC For C/C++</a></i></b> version 1.40.0<br></p></BODY></HTML>
curl: (7) Failed to connect to laura2.gfs port 2633: Connection refused

curl: (56) Recv failure: Connection reset by peer

The error above happens on all 3 hosts. When I initially configured all the servers they worked, but after a restart it stopped working

This network error could be because some interface miss configuration, or any other networking issue … Please also check that the oned’s are actually running

Interfaces are fine, I have other services (glusterfs, ssh,…) running fine on them and oned is running on all hosts

I don’t think its a network error, sometimes I cant access them on localhost either

root@laura3:~# curl localhost:2633/RPC2
curl: (56) Recv failure: Connection reset by peer

root@laura3:~# ps aux | grep oned
root      2808  0.0  0.0  13084  1036 pts/1    S+   15:52   0:00 grep oned
oneadmin 19246  0.2  0.0 3003944 34348 ?       Ssl  15:43   0:01 /usr/bin/oned -f

I am not really sure, the error in your setup is because oned’s cannot talk to each other. If there is no firewall, no connectivity issues, then maybe restarting them helps. Check that the port is listening with netstat or ss

Already restarted but didn’t fix it, oned is listening on the port

root@laura:~# netstat -natup | grep oned
tcp 16 0 0.0.0.0:2633 0.0.0.0:* LISTEN 47408/oned

root@laura2:~# netstat -natup | grep oned
tcp 0 0 0.0.0.0:2633 0.0.0.0:* LISTEN 18276/oned

root@laura3:~# netstat -natup | grep oned
tcp 16 0 0.0.0.0:2633 0.0.0.0:* LISTEN 19246/oned

Are other commands working?

No, connection reset by peer on all commands and all hosts

Sorry Andre, out of ideas, maybe double check firewall, DNS resolves and
that the request are received in the target host with tcpdump…

Good luck.

Check your network setup. Using LACP bonds? MTU set correctly for interfaces?

As Ruben said try dig a bit deeper with tcpdump - this really helped us finding several smaller problems with our network configuration.

Yes, using LACP bound. MTU is 1500 on all hosts, will check with tcpdump if any errors show up

I could not find any obvious problem with tcpdump.

I still dont think its network related because when I first configured the hosts it was working fine, it was after the restart of oned that the problems begin. And sometimes I cant access it even on localhost which I think its a problem on oned.

Any other ideas?

Andre, I was thinking that this maybe your problem:

# Executed when a server transits from follower->leader
RAFT_LEADER_HOOK = [
     COMMAND = "raft/vip.sh",
     ARGUMENTS = "leader eth0 10.3.3.2/24"
]

# Executed when a server transits from leader->follower
RAFT_FOLLOWER_HOOK = [
    COMMAND = "raft/follower.sh",
    ARGUMENTS = "follower eth0 10.3.3.2/24"
]

Do you have the hooks enabled? Could you take a look at the hooks and see if make sense or any potential problem in your setup? They are pretty straightforward

Yes, I have them enabled

Executed when a server transits from follower->leader

RAFT_LEADER_HOOK = [
COMMAND = “raft/vip.sh”,
ARGUMENTS = “leader brvirt xxx.yyy.zzz.139/27”
]

Executed when a server transits from leader->follower

RAFT_FOLLOWER_HOOK = [
COMMAND = “raft/vip.sh”,
ARGUMENTS = “follower brvirt xxx.yyy.zzz.139/27”
]

Other question, the federation mode is standalone, is that correct (changing the SERVER_ID on each host)?

FEDERATION = [
MODE = “STANDALONE”,
ZONE_ID = 0,
SERVER_ID = 2,
MASTER_ONED = “”
]

Could you check the scripts, maybe they are not working in your setup and messing every thing.

FEDERATION stanza is correct

What should happen if two out of three servers goes down? The only one left should be elected leader right? I manually stopped oned on the other hosts and changed their ip so that any check that the remaining server do should timeout (simulating a server crash). The server with oned running is still trying to connect to the others and is still in ‘candidate’ state, that seems wrong to me. Ideas?

No, that’s the expected behavior. In order to become leader you need a majority, two out of three in your setup. With just one server the cluster cannot work (note that this prevents split brain, conditions).

Make sense…

I have configured the hostnames in /etc/hosts and not in the dns server, do you think this can be a problem?

And what does this mean?

one.zone.voterequest result FAILURE [one.zone.voterequest] Candidate's log is outdated