HA issue: zone 0 is missing a leader

Hi,

While setting up HA on our existing standalone 5.4.1 test environment (on Ubuntu 16.0.4) I made a mistake and end up having Zone 0 with a single follower (which used to be the leader) and Opennebula is not functional anymore.
If I try to remove that single follower, the command fails saying that the zone has no leader.

/var/log/one# onezone show 0
ZONE 0 INFORMATION
ID : 0
NAME : OpenNebula

ZONE SERVERS
ID NAME ENDPOINT
0 server-0 http:// xxxxx:2633/RPC2

HA & FEDERATION SYNC STATUS
ID NAME STATE TERM INDEX COMMIT VOTE FED_INDEX
0 server-0 follower 2 147 147 -1 -1

ZONE TEMPLATE
ENDPOINT=“http://localhost:2633/RPC2

I tried to recover by reloading a DB export and oned.conf backups but that didn’t help (notice the error from “/usr/share/one/follower_cleanup”).

root@coenebula01:/# systemctl status opennebula
● opennebula.service - OpenNebula Cloud Controller Daemon
Loaded: loaded (/lib/systemd/system/opennebula.service; disabled; vendor preset: enabled)
Active: active (running) since Fri 2017-12-22 10:13:38 MST; 57min ago
Process: 3241 ExecStopPost=/usr/share/one/follower_cleanup (code=exited, status=2)
Process: 3238 ExecStopPost=/bin/rm -f /var/lock/one/one (code=exited, status=0/SUCCESS)
Process: 3222 ExecStop=/bin/kill -TERM $MAINPID (code=exited, status=0/SUCCESS)
Process: 3356 ExecStartPre=/usr/sbin/logrotate -s /tmp/logrotate.state -f /etc/logrotate.d/opennebula (code=exited, status=0/SUCCESS)
Process: 3351 ExecStartPre=/bin/chown oneadmin:oneadmin /var/log/one (code=exited, status=0/SUCCESS)
Process: 3348 ExecStartPre=/bin/mkdir -p /var/log/one (code=exited, status=0/SUCCESS)
Main PID: 3362 (oned)
Tasks: 103
Memory: 92.4M
CPU: 16.283s
CGroup: /system.slice/opennebula.service
├─3362 /usr/bin/oned -f
├─3374 ruby /usr/lib/one/mads/one_hm.rb
├─3410 ruby /usr/lib/one/mads/one_vmm_exec.rb -t 15 -r 0 kvm
├─3427 ruby /usr/lib/one/mads/one_vmm_exec.rb -l deploy,shutdown,reboot,cancel,save,restore,migrate,poll,pre,post,clean,snapshotcreate,snapshotrevert,snapshotdelete,attach_nic,de
├─3444 /usr/lib/one/mads/collectd -p 4124 -f 5 -t 50 -i 20
├─3497 ruby /usr/lib/one/mads/one_im_exec.rb -r 3 -t 15 kvm
├─3512 ruby /usr/lib/one/mads/one_tm.rb -t 15 -d dummy,lvm,shared,fs_lvm,qcow2,ssh,ceph,dev,vcenter,iscsi_libvirt
├─3532 ruby /usr/lib/one/mads/one_datastore.rb -t 15 -d dummy,fs,lvm,ceph,dev,iscsi_libvirt,vcenter -s shared,ssh,ceph,fs_lvm,qcow2,vcenter
├─3548 ruby /usr/lib/one/mads/one_market.rb -t 15 -m http,s3,one
├─3564 ruby /usr/lib/one/mads/one_ipam.rb -t 1 -i dummy
└─3577 ruby /usr/lib/one/mads/one_auth_mad.rb --authn ssh,x509,ldap,server_cipher,server_x509

Dec 22 10:13:38 coenebula01 systemd[1]: Starting OpenNebula Cloud Controller Daemon…
Dec 22 10:13:38 coenebula01 systemd[1]: Started OpenNebula Cloud Controller Daemon.

.
Before I go ahead and rebuild the whole environment, would somebody have an idea how could I recover from this state??

oned.log and sched.log are being updated with these lines:

root@coenebula01:/var/log/one# tail oned.log
Fri Dec 22 11:13:44 2017 [Z0][ReM][D]: Req:6368 UID:0 one.zone.raftstatus invoked
Fri Dec 22 11:13:44 2017 [Z0][ReM][D]: Req:6368 UID:0 one.zone.raftstatus result SUCCESS, "<SERVER_ID>-1<…"
Fri Dec 22 11:14:14 2017 [Z0][ReM][D]: Req:6080 UID:0 one.zone.raftstatus invoked
Fri Dec 22 11:14:14 2017 [Z0][ReM][D]: Req:6080 UID:0 one.zone.raftstatus result SUCCESS, "<SERVER_ID>-1<…"
Fri Dec 22 11:14:44 2017 [Z0][ReM][D]: Req:8000 UID:0 one.zone.raftstatus invoked
Fri Dec 22 11:14:44 2017 [Z0][ReM][D]: Req:8000 UID:0 one.zone.raftstatus result SUCCESS, "<SERVER_ID>-1<…"
Fri Dec 22 11:15:14 2017 [Z0][ReM][D]: Req:2000 UID:0 one.zone.raftstatus invoked
Fri Dec 22 11:15:14 2017 [Z0][ReM][D]: Req:2000 UID:0 one.zone.raftstatus result SUCCESS, "<SERVER_ID>-1<…"
Fri Dec 22 11:15:44 2017 [Z0][ReM][D]: Req:9728 UID:0 one.zone.raftstatus invoked
Fri Dec 22 11:15:44 2017 [Z0][ReM][D]: Req:9728 UID:0 one.zone.raftstatus result SUCCESS, “<SERVER_ID>-1<…”

root@coenebula01:/var/log/one# tail sched.log
Fri Dec 22 11:11:44 2017 [Z0][SCHED][E]: oned is not leader
Fri Dec 22 11:12:14 2017 [Z0][SCHED][E]: oned is not leader
Fri Dec 22 11:12:44 2017 [Z0][SCHED][E]: oned is not leader
Fri Dec 22 11:13:14 2017 [Z0][SCHED][E]: oned is not leader
Fri Dec 22 11:13:44 2017 [Z0][SCHED][E]: oned is not leader
Fri Dec 22 11:14:14 2017 [Z0][SCHED][E]: oned is not leader
Fri Dec 22 11:14:44 2017 [Z0][SCHED][E]: oned is not leader
Fri Dec 22 11:15:14 2017 [Z0][SCHED][E]: oned is not leader
Fri Dec 22 11:15:44 2017 [Z0][SCHED][E]: oned is not leader
Fri Dec 22 11:16:14 2017 [Z0][SCHED][E]: oned is not leader
root@coenebula01:/var/log/one#

Thanks a lot,

Alex

I have exactly the same issue using OpenNebula 5.4.6 and 3 servers in HA. on Centos 7

Did you find any workaround on it or you simply re-created the environment?

Thanks,
Bogdan

Try to set the leader of the cluster in solo mode to remove all the followers

Hello,

(Sorry the delay)
No, I did not have to re-create the environment.

As far as can remember, here what I did to recover:

  • Just recovering the existing database from a backup didn’t help (as I mentioned on my post). So, as I already had nothing to lose, I actually deleted the database and restored it from the backup I had created just prior to start the HA setup.

  • Restored from backup all OpenNebula configuration files I had to change in order to setup HA.

It came back up in single mode exactly to the point where I took the database backup.

After that, I initiated the HA configuration process again and could make it happen. Our 3 nodes HA setup has been working since then.

Alex

Hi Ruben and Alex,

Thanks for your replies.

Ruben, I’ve tried your suggestions but it didn’t work. Basically I believe the issue is that I cannot reset the “term” number of the leader server and somehow it keeps a wrong/old configuration of the cluster.

Please see below logs:

Before:

onezone show 0
ZONE 0 INFORMATION
ID : 0
NAME : OpenNebula

ZONE SERVERS
ID NAME ENDPOINT
0 server0 http://server0:2633/RPC2

HA & FEDERATION SYNC STATUS
ID NAME STATE TERM INDEX COMMIT VOTE FED_INDEX
0 server0 leader 51639 921904 921904 0 -1

ZONE TEMPLATE
ENDPOINT=“http://localhost:2633/RPC2


After:


onezone show 0
ZONE 0 INFORMATION
ID : 0
NAME : OpenNebula

ZONE SERVERS
ID NAME ENDPOINT
0 server0 http://server0:2633/RPC2

HA & FEDERATION SYNC STATUS
ID NAME STATE TERM INDEX COMMIT VOTE FED_INDEX
0 server0 solo 51639 -1 0 0 -1

onezone server-del 0 1
[one.zone.delserver] SERVER not found in zone

onezone server-del 0 2
[one.zone.delserver] SERVER not found in zone

This is what is happening when the other two servers in zone have OpenNebula service stopped.

What is interesting comes next, when I’ve enabled the service on both servers 1 and 2 while server 0 is still in solo mode. Server0 is back follower while it should remain in solo mode without participating to the election.

cat /etc/one/oned.conf | grep SERVER_ID

SERVER_ID: ID identifying this server in the zone as returned by the

SERVER_ID     = -1,

onezone show 0
ZONE 0 INFORMATION
ID : 0
NAME : OpenNebula

ZONE SERVERS
ID NAME ENDPOINT
0 server0 http://server0:2633/RPC2

HA & FEDERATION SYNC STATUS
ID NAME STATE TERM INDEX COMMIT VOTE FED_INDEX
0 server0 follower 51640 921905 921905 2 -1

To me it looks like something is messed with the database and I have no option to reset the TERM number to 0 or to flush an old zone setup.

Do you guys have any other clue?

Thanks,
Bogdan