I managed to install and configure Opennebula 5.4 with builtin HA but after a service restart it stopped working. The error is: Fri Jul 28 13:57:27 2017 [Z0][RCM][I]: Error requesting vote from follower 0:libcurl failed to execute the HTTP POST transaction, explaining: Operation timed out after 3003 milliseconds with 0 bytes received Fri Jul 28 13:57:27 2017 [Z0][RCM][I]: Error requesting vote from follower 1:libcurl failed to execute the HTTP POST transaction, explaining: Failed to connect to laura2.gfs port 2633: Connection refused
it takes a lot of time to complete the ‘onezone show 0’ command but eventually it return:
root@laura:~# onezone show 0
ZONE 0 INFORMATION
ID : 0
NAME : OpenNebula
ZONE SERVERS
ID NAME ENDPOINT
0 laura3 http://laura3.gfs:2633/RPC2
1 laura2 http://laura2.gfs:2633/RPC2
2 laura http://laura.gfs:2633/RPC2
HA & FEDERATION SYNC STATUS
ID NAME STATE TERM INDEX COMMIT VOTE FED_INDEX
0 laura3 error - - - -
1 laura2 error - - - -
2 laura candidate 2744 483373 483373 2 -1
ZONE TEMPLATE
ENDPOINT="http://localhost:2633/RPC2"
I have 3 servers configured with MariaDB on Debian Stretch.
PS: added a space after http:// due to forum restrictions
This behavior is normal when servers cannot talk to each other. onezone show is taking longer becasuse laura2 and laura3 are timing out. You need to make sure that the ENDPOINT URL are accesible from all servers (http://laura3.gfs:2633/RPC2…)
There is no firewall between them but the behavior is not consistent, sometimes I can connect, sometimes not:
root@laura:~# curl laura.gfs:2633/RPC2; echo ""; curl laura2.gfs:2633/RPC2; echo ""; curl laura3.gfs:2633/RPC2; echo "";
<HTML><HEAD><TITLE>Error 405</TITLE></HEAD><BODY><H1>Error 405</H1><P>POST is the only HTTP method this server understands</P><p><HR><b><i><a href="http://xmlrpc-c.sourceforge.net">ABYSS Web Server for XML-RPC For C/C++</a></i></b> version 1.40.0<br></p></BODY></HTML>
curl: (7) Failed to connect to laura2.gfs port 2633: Connection refused
curl: (56) Recv failure: Connection reset by peer
The error above happens on all 3 hosts. When I initially configured all the servers they worked, but after a restart it stopped working
This network error could be because some interface miss configuration, or any other networking issue … Please also check that the oned’s are actually running
I am not really sure, the error in your setup is because oned’s cannot talk to each other. If there is no firewall, no connectivity issues, then maybe restarting them helps. Check that the port is listening with netstat or ss
I could not find any obvious problem with tcpdump.
I still dont think its network related because when I first configured the hosts it was working fine, it was after the restart of oned that the problems begin. And sometimes I cant access it even on localhost which I think its a problem on oned.
Andre, I was thinking that this maybe your problem:
# Executed when a server transits from follower->leader
RAFT_LEADER_HOOK = [
COMMAND = "raft/vip.sh",
ARGUMENTS = "leader eth0 10.3.3.2/24"
]
# Executed when a server transits from leader->follower
RAFT_FOLLOWER_HOOK = [
COMMAND = "raft/follower.sh",
ARGUMENTS = "follower eth0 10.3.3.2/24"
]
Do you have the hooks enabled? Could you take a look at the hooks and see if make sense or any potential problem in your setup? They are pretty straightforward
What should happen if two out of three servers goes down? The only one left should be elected leader right? I manually stopped oned on the other hosts and changed their ip so that any check that the remaining server do should timeout (simulating a server crash). The server with oned running is still trying to connect to the others and is still in ‘candidate’ state, that seems wrong to me. Ideas?
No, that’s the expected behavior. In order to become leader you need a majority, two out of three in your setup. With just one server the cluster cannot work (note that this prevents split brain, conditions).