Frontend HA RAFT problems

Hello, I setup frontend HA using new RAFT, but today its break.n I see in logs this error messages:

This messages shows also after cluster rebuild…Cluster breaks because of heartbeat from leader timeout, but it was not possible to make candidate leader, because of older log that was on original leader…

Mon Jul 31 16:13:14 2017 [Z0][DBM][E]: Log record 214376 loaded incorrectly. Record index: 214377 fed. index: 0 sql command: . Operation return code: 0
Mon Jul 31 16:13:14 2017 [Z0][DBM][E]: Log record 214377 loaded incorrectly. Record index: 214378 fed. index: 0 sql command: . Operation return code: 0
Mon Jul 31 16:13:14 2017 [Z0][DBM][E]: Log record 214378 loaded incorrectly. Record index: 214379 fed. index: 0 sql command: . Operation return code: 0
Mon Jul 31 16:13:14 2017 [Z0][DBM][E]: Log record 214379 loaded incorrectly. Record index: 214380 fed. index: 0 sql command: . Operation return code: 0
Mon Jul 31 16:13:14 2017 [Z0][DBM][E]: Log record 214380 loaded incorrectly. Record index: 214381 fed. index: 0 sql command: . Operation return code: 0
Mon Jul 31 16:13:14 2017 [Z0][DBM][E]: Log record 214381 loaded incorrectly. Record index: 214382 fed. index: 0 sql command: . Operation return code: 0
Mon Jul 31 16:13:14 2017 [Z0][DBM][E]: Log record 214382 loaded incorrectly. Record index: 214383 fed. index: 0 sql command: . Operation return code: 0
Mon Jul 31 16:13:14 2017 [Z0][DBM][E]: Log record 214383 loaded incorrectly. Record index: 214384 fed. index: 0 sql command: . Operation return code: 0
Mon Jul 31 16:13:14 2017 [Z0][DBM][E]: Log record 214384 loaded incorrectly. Record index: 214385 fed. index: 0 sql command: . Operation return code: 0
Mon Jul 31 16:13:14 2017 [Z0][DBM][E]: Log record 214385 loaded incorrectly. Record index: 214386 fed. index: 0 sql command: . Operation return code: 0
Mon Jul 31 16:13:14 2017 [Z0][DBM][E]: Log record 214386 loaded incorrectly. Record index: 214387 fed. index: 0 sql command: . Operation return code: 0
Mon Jul 31 16:13:14 2017 [Z0][DBM][E]: Log record 214387 loaded incorrectly. Record index: 214388 fed. index: 0 sql command: . Operation return code: 0
Mon Jul 31 16:13:14 2017 [Z0][DBM][E]: Log record 214388 loaded incorrectly. Record index: 214389 fed. index: 0 sql command: . Operation return code: 0
Mon Jul 31 16:13:14 2017 [Z0][DBM][E]: Log record 214389 loaded incorrectly. Record index: 214390 fed. index: 0 sql command: . Operation return code: 0
Mon Jul 31 16:13:14 2017 [Z0][DBM][E]: Log record 214390 loaded incorrectly. Record index: 214391 fed. index: 0 sql command: . Operation return code: 0
Mon Jul 31 16:13:14 2017 [Z0][DBM][E]: Log record 214391 loaded incorrectly. Record index: 214392 fed. index: 0 sql command: . Operation return code: 0
Mon Jul 31 16:13:14 2017 [Z0][DBM][E]: Log record 214392 loaded incorrectly. Record index: 214393 fed. index: 0 sql command: . Operation return code: 0
Mon Jul 31 16:13:14 2017 [Z0][DBM][E]: Log record 214393 loaded incorrectly. Record index: 214394 fed. index: 0 sql command: . Operation return code: 0
Mon Jul 31 16:13:14 2017 [Z0][DBM][E]: Log record 214394 loaded incorrectly. Record index: 214395 fed. index: 0 sql command: . Operation return code: 0
Mon Jul 31 16:13:14 2017 [Z0][DBM][E]: Log record 214395 loaded incorrectly. Record index: 214396 fed. index: 0 sql command: . Operation return code: 0
Mon Jul 31 16:13:15 2017 [Z0][DBM][E]: Log record 214396 loaded incorrectly. Record index: 214397 fed. index: 0 sql command: . Operation return code: 0
Mon Jul 31 16:13:15 2017 [Z0][DBM][E]: Log record 214397 loaded incorrectly. Record index: 214398 fed. index: 0 sql command: . Operation return code: 0
Mon Jul 31 16:13:15 2017 [Z0][DBM][E]: Log record 214398 loaded incorrectly. Record index: 214399 fed. index: 0 sql command: . Operation return code: 0
Mon Jul 31 16:13:15 2017 [Z0][DBM][E]: Log record 214399 loaded incorrectly. Record index: 214400 fed. index: 0 sql command: . Operation return code: 0
Mon Jul 31 16:13:15 2017 [Z0][DBM][E]: Log record 214400 loaded incorrectly. Record index: 214401 fed. index: 0 sql command: . Operation return code: 0
Mon Jul 31 16:13:15 2017 [Z0][DBM][E]: Log record 214401 loaded incorrectly. Record index: 214402 fed. index: 0 sql command: . Operation return code: 0
Mon Jul 31 16:13:15 2017 [Z0][DBM][E]: Log record 214402 loaded incorrectly. Record index: 214403 fed. index: 0 sql command: . Operation return code: 0
Mon Jul 31 16:13:15 2017 [Z0][DBM][E]: Log record 214403 loaded incorrectly. Record index: 214404 fed. index: 0 sql command: . Operation return code: 0
Mon Jul 31 16:13:15 2017 [Z0][DBM][E]: Log record 214404 loaded incorrectly. Record index: 214405 fed. index: 0 sql command: . Operation return code: 0
Mon Jul 31 16:13:15 2017 [Z0][DBM][E]: Log record 214405 loaded incorrectly. Record index: 214406 fed. index: 0 sql command: . Operation return code: 0
Mon Jul 31 16:13:15 2017 [Z0][DBM][E]: Log record 214406 loaded incorrectly. Record index: 214407 fed. index: 0 sql command: . Operation return code: 0
Mon Jul 31 16:13:15 2017 [Z0][DBM][E]: Log record 214407 loaded incorrectly. Record index: 214408 fed. index: 0 sql command: . Operation return code: 0
Mon Jul 31 16:13:15 2017 [Z0][DBM][E]: Log record 214408 loaded incorrectly. Record index: 214409 fed. index: 0 sql command: . Operation return code: 0
Mon Jul 31 16:13:15 2017 [Z0][DBM][E]: Log record 214409 loaded incorrectly. Record index: 214410 fed. index: 0 sql command: . Operation return code: 0
Mon Jul 31 16:13:15 2017 [Z0][DBM][E]: Log record 214410 loaded incorrectly. Record index: 214411 fed. index: 0 sql command: . Operation return code: 0
Mon Jul 31 16:23:13 2017 [Z0][DBM][E]: Tried to modify DB being a follower
Mon Jul 31 16:23:13 2017 [Z0][DBM][E]: Tried to modify DB being a follower
Mon Jul 31 16:23:13 2017 [Z0][VMM][D]: VM 12076 successfully monitored: STATE=a CPU=11.02 MEMORY=6291456 NETRX=22408896196 NETTX=36944377281 DISKRDBYTES=6077210624 DISKWRBYTES=208243122176 DISKRDIOPS=328536 DISKWRIOPS=16799112
Mon Jul 31 16:23:13 2017 [Z0][DBM][E]: Tried to modify DB being a follower
Mon Jul 31 16:23:13 2017 [Z0][DBM][E]: Tried to modify DB being a follower
Mon Jul 31 16:23:13 2017 [Z0][VMM][D]: VM 12489 successfully monitored: STATE=a CPU=0.0 MEMORY=1112768 NETRX=300076514 NETTX=61397844 DISKRDBYTES=137257984 DISKWRBYTES=9631357952 DISKRDIOPS=6900 DISKWRIOPS=1806029
Mon Jul 31 16:23:13 2017 [Z0][DBM][E]: Tried to modify DB being a follower
Mon Jul 31 16:23:13 2017 [Z0][DBM][E]: Tried to modify DB being a follower
Mon Jul 31 16:23:13 2017 [Z0][VMM][D]: VM 12517 successfully monitored: STATE=a CPU=0.0 MEMORY=604396 NETRX=116811657 NETTX=9502814 DISKRDBYTES=288135452 DISKWRBYTES=1499089920 DISKRDIOPS=9344 DISKWRIOPS=237838
Mon Jul 31 16:23:13 2017 [Z0][DBM][E]: Tried to modify DB being a follower
Mon Jul 31 16:23:13 2017 [Z0][DBM][E]: Tried to modify DB being a follower
Mon Jul 31 16:23:13 2017 [Z0][VMM][D]: VM 12518 successfully monitored: STATE=a CPU=0.0 MEMORY=2097152 NETRX=110114539 NETTX=52435106 DISKRDBYTES=472254808 DISKWRBYTES=630342656 DISKRDIOPS=24716 DISKWRIOPS=21192
Mon Jul 31 16:23:13 2017 [Z0][DBM][E]: Tried to modify DB being a follower
Mon Jul 31 16:23:13 2017 [Z0][DBM][E]: Tried to modify DB being a follower
Mon Jul 31 16:23:13 2017 [Z0][VMM][D]: VM 12519 successfully monitored: STATE=a CPU=0.0 MEMORY=2097152 NETRX=80919107 NETTX=24123289 DISKRDBYTES=472140120 DISKWRBYTES=632419328 DISKRDIOPS=24700 DISKWRIOPS=21357
Mon Jul 31 16:23:13 2017 [Z0][DBM][E]: Tried to modify DB being a follower
Mon Jul 31 16:23:13 2017 [Z0][DBM][E]: Tried to modify DB being a follower

Error connecting but host is running

Error requesting vote from follower 2:HTTP POST to URL 'http://192.168.2.53:2633/RPC2' failed.  libcurl failed even to execute the HTTP transaction, explaining:  Operation timed out after 3003 milliseconds with 0 out of -1 bytes received

Hi Kristian

Let see if I can understand the error.

  1. The errar accessing the log records should not happen but oned should be able to deal with them. Is the DB ok? I mean you could check select * from logdb where log_index = 214376, and see if the record is consistent in the DB.

  2. Could you send the log’s of servers, probably the messages around the failure is enough

  3. We’ve found a bug that sigabrt oned, but it seems your oned’s are running. Did you restart them?. Did it solve the problem?

I rebuild it and try to restart machine with leader. it breaks and candidate cannot contact remaining host. so I restart every (3) nodes and it formed cluster again, but after while leader crash… and candidate cant get votes

Mon Jul 31 17:01:12 2017 [Z0][RCM][I]: Error requesting vote from follower 0:HTTP POST to URL 'http://192.168.2.51:2633/RPC2' failed.  libcurl failed even to execute the HTTP transaction, explaining:  Failed connect to 192.168.2.51:2633; Connection refused
Mon Jul 31 17:01:15 2017 [Z0][RCM][I]: Error requesting vote from follower 1:HTTP POST to URL 'http://192.168.2.52:2633/RPC2' failed.  libcurl failed even to execute the HTTP transaction, explaining:  Operation timed out after 3003 milliseconds with 0 out of -1 bytes received
Mon Jul 31 17:01:17 2017 [Z0][ReM][D]: Req:4704 UID:0 one.zone.voterequest invoked , 533, 1, 221946, 532
Mon Jul 31 17:01:17 2017 [Z0][ReM][D]: Req:4704 UID:0 one.zone.voterequest result SUCCESS, 533
Mon Jul 31 17:01:18 2017 [Z0][RCM][I]: Error requesting vote from follower 0:HTTP POST to URL 'http://192.168.2.51:2633/RPC2' failed.  libcurl failed even to execute the HTTP transaction, explaining:  Failed connect to 192.168.2.51:2633; Connection refused
Mon Jul 31 17:01:21 2017 [Z0][RCM][I]: Error requesting vote from follower 1:HTTP POST to URL 'http://192.168.2.52:2633/RPC2' failed.  libcurl failed even to execute the HTTP transaction, explaining:  Operation timed out after 3003 milliseconds with 0 out of -1 bytes received
Mon Jul 31 17:01:24 2017 [Z0][RCM][I]: Error requesting vote from follower 0:HTTP POST to URL 'http://192.168.2.51:2633/RPC2' failed.  libcurl failed even to execute the HTTP transaction, explaining:  Failed connect to 192.168.2.51:2633; Connection refused

192.168.2.51 is down - it is ok
but 2.52 is up

May it has something to do with the HOOK? could you check the the network
is ok

network is ok and each node can see/ping each other. I also try curl to RPC api and was working…

Do you have the HA hooks enabled?

# Executed when a server transits from follower->leader
RAFT_LEADER_HOOK = [
     COMMAND = "raft/vip.sh",
     ARGUMENTS = "leader eth0 10.3.3.2/24"
]

# Executed when a server transits from leader->follower
RAFT_FOLLOWER_HOOK = [
    COMMAND = "raft/follower.sh",
    ARGUMENTS = "follower eth0 10.3.3.2/24"
]```

yes, I have. My hook script looks like this

#!/bin/bash -e

ACTION="$1"

case $ACTION in
leader)
    sudo ip address add 192.168.2.50/24 dev eth1
    sudo ip address add 185.174.168.10/24 dev eth0
    sudo ip route replace 192.168.2.0/24 dev eth1 proto kernel scope link src 192.168.2.50 metric 100
    sudo ip route replace default via 185.174.168.254 dev eth0 proto static metric 100
    arping -c 5 -A -I eth1 192.168.2.50 & arping -c 5 -A -I eth0 185.174.168.10
    ;;

follower)
    sudo ip route replace 192.168.2.0/24 dev eth1 proto kernel scope link src 192.168.2.51(or 52 or 53) metric 100
    sudo ip route replace default via 192.168.2.254 dev eth1 proto static metric 100
    sudo ip address del 192.168.2.50/24 dev eth1
    sudo ip address del 185.174.168.10/24 dev eth0
    ;;

*)
    echo "Unknown action '$ACTION'" >&2
    exit 1
    ;;
esac

exit 0

I also try remove route src address replacing, but no effect

When cluster was up and running (also with that ‘log incorectly loaded’) DB was replicating good.

Problem look like is about geting vote from alive node, and also that it try to get vote form dead node

Yes I am thinking that this is making somehow the network not working and hence no able to update the cluster status, leadership etc…

Could you disable them, configure the network and restart everything? Just to be sure that the hooks are not making something to fail.

ok, i try to disable hooks.

In a failure situation this is normal:

Mon Jul 31 17:01:21 2017 [Z0][RCM][I]: Error requesting vote from follower 1:HTTP POST to URL 'http://192.168.2.52:2633/RPC2' failed.  libcurl failed even to execute the HTTP transaction, explaining:  Operation timed out after 3003 milliseconds with 0 out of -1 bytes received

node fails, you cannot contact it.

But this:

Mon Jul 31 17:01:18 2017 [Z0][RCM][I]: Error requesting vote from follower 0:HTTP POST to URL 'http://192.168.2.51:2633/RPC2' failed.  libcurl failed even to execute the HTTP transaction, explaining:  Failed connect to 192.168.2.51:2633; Connection refused

Seems like a network issue. In this case two nodes down the cluster will not come up

hmm I disbaled hooks, restarted nodes - so have fresh network config, but still

Mon Jul 31 17:33:59 2017 [Z0][RCM][I]: Error requesting vote from follower 2:HTTP POST to URL 'http://192.168.2.53:2633/RPC2' failed.  libcurl failed even to execute the HTTP transaction, explaining:  Operation timed out after 3003 milliseconds with 0 out of -1 bytes received
Mon Jul 31 17:34:05 2017 [Z0][RCM][I]: Error requesting vote from follower 0:HTTP POST to URL 'http://192.168.2.51:2633/RPC2' failed.  libcurl failed even to execute the HTTP transaction, explaining:  Operation timed out after 3001 milliseconds with 0 out of -1 bytes received

I dont understand why,… what type of request it is sending? I cant reproduce it manualy to see result

onezone show 0 works? or does it also hangs till timeout?

It is simple querying the XML-RPC apI. Somehow from that node ‘http://192.168.2.53:2633/RPC2’ this does not work.

hmm, I disabled hooks and rebuild cluster. Everything ok, db was replication. so I execute init 6 on leader(node1) node.

from start it was looking good, but it still crashed. here logs:

node2

Mon Jul 31 17:59:56 2017 [Z0][RRM][E]: Failed to get heartbeat from leader. Starting election proccess
Mon Jul 31 17:59:59 2017 [Z0][ReM][D]: Req:2560 UID:0 one.zone.voterequest invoked , 751, 2, 225834, 750
Mon Jul 31 17:59:59 2017 [Z0][ReM][E]: Req:2560 UID:0 one.zone.voterequest result FAILURE [one.zone.voterequest] Already voted for another candidate
Mon Jul 31 17:59:59 2017 [Z0][RCM][I]: Error requesting vote from follower 0:HTTP POST to URL 'http://192.168.2.51:2633/RPC2' failed.  libcurl failed even to execute the HTTP transaction, explaining:  Connection timed out after 3003 milliseconds
Mon Jul 31 17:59:59 2017 [Z0][RCM][I]: Got vote from follower 2. Total votes: 1
Mon Jul 31 17:59:59 2017 [Z0][RCM][I]: Got majority of votes
Mon Jul 31 18:00:03 2017 [Z0][RCM][I]: Becoming leader of the zone. Last log record: 225834 last applied record: 225834
Mon Jul 31 18:00:03 2017 [Z0][RCM][I]: oned is now the leader of the zone
Mon Jul 31 18:00:03 2017 [Z0][RCM][I]: Follower 2 term (752) is higher than current (751)
Mon Jul 31 18:00:04 2017 [Z0][ReM][D]: Req:6640 UID:0 one.zone.voterequest invoked , 752, 2, 225834, 750
Mon Jul 31 18:00:08 2017 [Z0][ReM][I]: New term (752) discovered from candidate 2
Mon Jul 31 18:00:08 2017 [Z0][RCM][I]: Replication thread stopped
Mon Jul 31 18:00:08 2017 [Z0][RCM][I]: oned is set to follower mode
Mon Jul 31 18:00:08 2017 [Z0][RCM][I]: Replication thread stopped
Mon Jul 31 18:00:08 2017 [Z0][RCM][I]: Replication thread stopped
Mon Jul 31 18:00:09 2017 [Z0][RCM][I]: Replication thread stopped
Mon Jul 31 18:00:10 2017 [Z0][RRM][E]: Failed to get heartbeat from leader. Starting election proccess
Mon Jul 31 18:00:10 2017 [Z0][ReM][D]: Req:5472 UID:0 one.zone.voterequest invoked , 753, 2, 225834, 750
Mon Jul 31 18:00:12 2017 [Z0][RCM][I]: oned is set to follower mode
Mon Jul 31 18:00:12 2017 [Z0][RCM][I]: Error requesting vote from follower 0:HTTP POST to URL 'http://192.168.2.51:2633/RPC2' failed.  libcurl failed even to execute the HTTP transaction, explaining:  Failed connect to 192.168.2.51:2633; Connection refused
Mon Jul 31 18:00:12 2017 [Z0][ReM][D]: Req:6640 UID:0 one.zone.voterequest result SUCCESS, 751
Mon Jul 31 18:00:12 2017 [Z0][ReM][D]: Req:5472 UID:0 one.zone.voterequest result SUCCESS, 753
Mon Jul 31 18:00:15 2017 [Z0][RCM][I]: Error requesting vote from follower 2:HTTP POST to URL 'http://192.168.2.53:2633/RPC2' failed.  libcurl failed even to execute the HTTP transaction, explaining:  Operation timed out after 3003 milliseconds with 0 out of -1 bytes received
Mon Jul 31 18:00:18 2017 [Z0][RRM][E]: Failed to get heartbeat from leader. Starting election proccess
Mon Jul 31 18:00:18 2017 [Z0][RCM][I]: Error requesting vote from follower 0:HTTP POST to URL 'http://192.168.2.51:2633/RPC2' failed.  libcurl failed even to execute the HTTP transaction, explaining:  Failed connect to 192.168.2.51:2633; Connection refused
Mon Jul 31 18:00:18 2017 [Z0][AuM][D]: Message received: AUTHENTICATE SUCCESS 16 -

Mon Jul 31 18:00:18 2017 [Z0][ReM][D]: Req:960 UID:1 one.documentpool.info invoked , -2, -1, -1, 100
Mon Jul 31 18:00:18 2017 [Z0][ReM][D]: Req:960 UID:1 one.documentpool.info result SUCCESS, "<DOCUMENT_POOL></DOC..."
Mon Jul 31 18:00:19 2017 [Z0][RCM][I]: Vote not granted from follower 2: [one.zone.voterequest] Already voted for another candidate

node3

Mon Jul 31 17:59:56 2017 [Z0][RRM][E]: Failed to get heartbeat from leader. Starting election proccess
Mon Jul 31 17:59:59 2017 [Z0][RCM][I]: Error requesting vote from follower 0:HTTP POST to URL 'http://192.168.2.51:2633/RPC2' failed.  libcurl failed even to execute the HTTP transaction, explaining:  Connection timed out after 3003 milliseconds
Mon Jul 31 17:59:59 2017 [Z0][RCM][I]: Vote not granted from follower 1: [one.zone.voterequest] Already voted for another candidate
Mon Jul 31 17:59:59 2017 [Z0][ReM][D]: Req:1824 UID:0 one.zone.voterequest invoked , 751, 1, 225834, 750
Mon Jul 31 17:59:59 2017 [Z0][ReM][D]: Req:1824 UID:0 one.zone.voterequest result SUCCESS, 751
Mon Jul 31 18:00:03 2017 [Z0][ReM][I]: Leader term (751) is outdated (752)
Mon Jul 31 18:00:03 2017 [Z0][ReM][D]: Req:5600 UID:0 one.zone.raftstatus invoked 
Mon Jul 31 18:00:03 2017 [Z0][ReM][D]: Req:5600 UID:0 one.zone.raftstatus result SUCCESS, "<RAFT><SERVER_ID>2</..."
Mon Jul 31 18:00:04 2017 [Z0][RCM][I]: Error requesting vote from follower 0:HTTP POST to URL 'http://192.168.2.51:2633/RPC2' failed.  libcurl failed even to execute the HTTP transaction, explaining:  Connection timed out after 3003 milliseconds
Mon Jul 31 18:00:07 2017 [Z0][RCM][I]: Error requesting vote from follower 1:HTTP POST to URL 'http://192.168.2.52:2633/RPC2' failed.  libcurl failed even to execute the HTTP transaction, explaining:  Operation timed out after 3003 milliseconds with 0 out of -1 bytes received
Mon Jul 31 18:00:10 2017 [Z0][RCM][I]: Error requesting vote from follower 0:HTTP POST to URL 'http://192.168.2.51:2633/RPC2' failed.  libcurl failed even to execute the HTTP transaction, explaining:  Failed connect to 192.168.2.51:2633; Connection refused
Mon Jul 31 18:00:12 2017 [Z0][RCM][I]: Got vote from follower 1. Total votes: 1
Mon Jul 31 18:00:12 2017 [Z0][RCM][I]: Got majority of votes
Mon Jul 31 18:00:12 2017 [Z0][ReM][D]: Req:1824 UID:0 one.zone.voterequest invoked , 752, 1, 225834, 750
Mon Jul 31 18:00:18 2017 [Z0][ReM][D]: Req:5248 UID:0 one.zone.voterequest invoked , 753, 1, 225834, 750
Mon Jul 31 18:00:19 2017 [Z0][RCM][I]: Becoming leader of the zone. Last log record: 225834 last applied record: 225834
Mon Jul 31 18:00:19 2017 [Z0][ReM][E]: Req:1824 UID:0 one.zone.voterequest result FAILURE [one.zone.voterequest] Candidate's term is outdated
Mon Jul 31 18:00:19 2017 [Z0][ReM][E]: Req:5248 UID:0 one.zone.voterequest result FAILURE [one.zone.voterequest] Already voted for another candidate
Mon Jul 31 18:00:19 2017 [Z0][RCM][I]: oned is now the leader of the zone
Mon Jul 31 18:00:22 2017 [Z0][RCM][I]: Follower 1 term (754) is higher than current (753)
Mon Jul 31 18:00:23 2017 [Z0][AuM][D]: Message received: AUTHENTICATE SUCCESS 13 -

I dont know, why node3 want to be candidate…

one node freeze in ‘candidate’ state until restareted/stoped one comes back. I also try it by stopping opennebula service.

one node keep as follower, second node candidate and freezed in that state until I start opennebula on “failed” node. them candidate become leader

It seems there is a network problem between node2 and node3, I don’t know
the reason but that failure prevents the cluster to recover from that
situation. If we are getting timeouts during the election, it may take some
time till it recovers… Can you send the same timeframes for both.

hmm, each node is VM running on another compute node. compute nodes use team0 + vlan + bridge. I have also configured bandwidth limiting on network interfaces and disks. When I ping each other it is working, also my connection from outside… When error about connecting apear in logs, in that time I can connect from one node other one.

Thinking about this, could it be a performance issue? If the servers are
deployed in a VM maybe they not get enough resources to process heartbeats
and other api calls… The timeouts are fairly large by default but maybe
you can increase them to see if this is a problem:

    ELECTION_TIMEOUT_MS  = 2500,
    BROADCAST_TIMEOUT_MS = 500,
    XMLRPC_TIMEOUT_MS    = 2000

hmm, 2,5s and 2s I think to be enogh. I try to create custom VLAN just for cluster communication.