Onehost sync fails during 5.12.0 upgrade

Hi

We are testing the new opennebula 5.12 community upgrade in our testbed, just following the docs:
https://docs.opennebula.io/5.12/intro_release_notes/upgrades/start_here.html

And also:
https://docs.opennebula.io/5.12/intro_release_notes/upgrades/upgrading_single.html#upgrade-single

but during the onehost sync step, we get the error:

$ onehost sync
* Adding hyp107.altaria.os to upgrade
* Adding hyp106.altaria.os to upgrade
* Adding hyp105.altaria.os to upgrade
* Adding hyp104.altaria.os to upgrade
[========================================] 4/4 hyp104.altaria.os                
Failed to update the following hosts:
* hyp107.altaria.os
* hyp105.altaria.os
* hyp106.altaria.os
* hyp104.altaria.os

And also the hyps are in error status after the upgrade (and VMs in unknown status). We didnt get any error during the rpms/db ugprade. Is this a known issue? We have upgraded from 5.8.1 to 5.12.0 using the community migrator package.

From the oned logs we can also see these error messages:

Mon Jul  6 17:15:45 2020 [Z0][AuM][D]: Message received: LOG I 4 Command execution failed (exit code: 255): /var/lib/one/remotes/auth/server_cipher/authenticate

Mon Jul  6 17:15:45 2020 [Z0][AuM][I]: Command execution failed (exit code: 255): /var/lib/one/remotes/auth/server_cipher/authenticate
Mon Jul  6 17:15:45 2020 [Z0][AuM][D]: Message received: LOG E 4 login token expired

Mon Jul  6 17:15:45 2020 [Z0][AuM][I]: login token expired
Mon Jul  6 17:15:45 2020 [Z0][AuM][D]: Message received: AUTHENTICATE FAILURE 4 login token expired

and from monitord.log

Mon Jul  6 17:24:07 2020 [Z0][MDP][W]: Start monitor failed for host 0: 
Mon Jul  6 17:24:07 2020 [Z0][HMM][E]: Unable to monitor host id: 0
Mon Jul  6 17:24:07 2020 [Z0][MDP][I]: 
Mon Jul  6 17:24:07 2020 [Z0][MDP][I]: 
Mon Jul  6 17:24:07 2020 [Z0][MDP][I]: 
Mon Jul  6 17:24:07 2020 [Z0][MDP][I]: 
Mon Jul  6 17:24:07 2020 [Z0][MDP][D]: [1:0:0] Recieved START_MONITOR message from host 3:

Cheers
Álvaro

Hello @alvaro_simongarcia,

Usually the sync fails when there is some file without enough permissions or when there is some symbolic link broke. Let’s try the first one, could you run find /var/lib/one/remotes ! -user oneadmin -exec ls -l {} \; in your frontend and share the output?

Hi @cgonzalez

Ah, indeed, we had a few files there with just root access and this was interfering with the sync. We did a backup as root for /var/lib/one/remotes/etc so we get several files with root permissions:

# find /var/lib/one/remotes ! -user oneadmin -exec ls -l {} \;
total 24
drwxr-x--- 4 root root 4096 Jul  6 14:28 datastore
drwxr-x--- 4 root root 4096 Jul  6 14:28 im
drwxr-x--- 3 root root 4096 Jul  6 14:28 market
drwxr-x--- 3 root root 4096 Jul  6 14:28 tm
drwxr-x--- 5 root root 4096 Jul  6 14:28 vmm
drwxr-x--- 2 root root 4096 Jul  6 14:28 vnm
total 12
drwxr-x--- 2 root root 4096 Jul  6 14:28 kvm
drwxr-x--- 2 root root 4096 Jul  6 14:28 lxd
drwxr-x--- 2 root root 4096 Jul  6 14:28 vcenter
total 4
-rw-r----- 1 root root 1668 Jul  6 14:28 vcenterrc
-rw-r----- 1 root root 1668 Jul  6 14:28 /var/lib/one/remotes/etc.2020-07-06/vmm/vcenter/vcenterrc
total 4
-rw-r----- 1 root root 3652 Jul  6 14:28 kvmrc
-rw-r----- 1 root root 3652 Jul  6 14:28 /var/lib/one/remotes/etc.2020-07-06/vmm/kvm/kvmrc
total 4
-rw-r----- 1 root root 2053 Jul  6 14:28 lxdrc
-rw-r----- 1 root root 2053 Jul  6 14:28 /var/lib/one/remotes/etc.2020-07-06/vmm/lxd/lxdrc
total 8
-rw-r----- 1 root root 4770 Jul  6 14:28 OpenNebulaNetwork.conf
-rw-r----- 1 root root 4770 Jul  6 14:28 /var/lib/one/remotes/etc.2020-07-06/vnm/OpenNebulaNetwork.conf
total 8
drwxr-x--- 2 root root 4096 Jul  6 14:28 kvm-probes.d
drwxr-x--- 2 root root 4096 Jul  6 14:28 lxd-probes.d
total 4
-rw-r----- 1 root root 2650 Jul  6 14:28 pci.conf
-rw-r----- 1 root root 2650 Jul  6 14:28 /var/lib/one/remotes/etc.2020-07-06/im/lxd-probes.d/pci.conf
total 4
-rw-r----- 1 root root 2650 Jul  6 14:28 pci.conf
-rw-r----- 1 root root 2650 Jul  6 14:28 /var/lib/one/remotes/etc.2020-07-06/im/kvm-probes.d/pci.conf
total 8
drwxr-x--- 2 root root 4096 Jul  6 14:28 ceph
drwxr-x--- 2 root root 4096 Jul  6 14:28 fs
total 4
-rw-r----- 1 root root 1238 Jul  6 14:28 fs.conf
-rw-r----- 1 root root 1238 Jul  6 14:28 /var/lib/one/remotes/etc.2020-07-06/datastore/fs/fs.conf
total 4
-rw-r----- 1 root root 1856 Jul  6 14:28 ceph.conf
-rw-r----- 1 root root 1856 Jul  6 14:28 /var/lib/one/remotes/etc.2020-07-06/datastore/ceph/ceph.conf
total 4
drwxr-x--- 2 root root 4096 Jul  6 14:28 fs_lvm
total 4
-rw-r----- 1 root root 1630 Jul  6 14:28 fs_lvm.conf
-rw-r----- 1 root root 1630 Jul  6 14:28 /var/lib/one/remotes/etc.2020-07-06/tm/fs_lvm/fs_lvm.conf
total 4
drwxr-x--- 2 root root 4096 Jul  6 14:28 http
total 4
-rw-r----- 1 root root 1238 Jul  6 14:28 http.conf
-rw-r----- 1 root root 1238 Jul  6 14:28 /var/lib/one/remotes/etc.2020-07-06/market/http/http.conf
total 20
drwxr-x--- 3 root root 4096 Mar 10 16:52 datastore
drwxr-x--- 4 root root 4096 Mar 10 16:52 im
drwxr-x--- 3 root root 4096 Mar 10 16:52 tm
drwxr-x--- 5 root root 4096 Mar 10 16:52 vmm
drwxr-x--- 2 root root 4096 Mar 10 16:52 vnm
total 12
drwxr-x--- 2 root root 4096 Mar 10 16:52 kvm
drwxr-x--- 2 root root 4096 Mar 10 16:52 lxd
drwxr-x--- 2 root root 4096 Mar 10 16:52 vcenter
total 4
-rw-r--r-- 1 root root 1513 Mar 10 16:52 vcenterrc
-rw-r--r-- 1 root root 1513 Mar 10 16:52 /var/lib/one/remotes/etc.2020-03-10/vmm/vcenter/vcenterrc
total 4
-rw-r--r-- 1 root root 3436 Mar 10 16:52 kvmrc
-rw-r--r-- 1 root root 3436 Mar 10 16:52 /var/lib/one/remotes/etc.2020-03-10/vmm/kvm/kvmrc
total 4
-rw-r--r-- 1 root root 2053 Mar 10 16:52 lxdrc
-rw-r--r-- 1 root root 2053 Mar 10 16:52 /var/lib/one/remotes/etc.2020-03-10/vmm/lxd/lxdrc
total 8
-rw-r--r-- 1 root root 4572 Mar 10 16:52 OpenNebulaNetwork.conf
-rw-r--r-- 1 root root 4572 Mar 10 16:52 /var/lib/one/remotes/etc.2020-03-10/vnm/OpenNebulaNetwork.conf
total 8
drwxr-x--- 2 root root 4096 Mar 10 16:52 kvm-probes.d
drwxr-x--- 2 root root 4096 Mar 10 16:52 lxd-probes.d
total 4
-rw-r--r-- 1 root root 2650 Mar 10 16:52 pci.conf
-rw-r--r-- 1 root root 2650 Mar 10 16:52 /var/lib/one/remotes/etc.2020-03-10/im/lxd-probes.d/pci.conf
total 4
-rw-r--r-- 1 root root 2650 Mar 10 16:52 pci.conf
-rw-r--r-- 1 root root 2650 Mar 10 16:52 /var/lib/one/remotes/etc.2020-03-10/im/kvm-probes.d/pci.conf
total 4
drwxr-x--- 2 root root 4096 Mar 10 16:52 ceph
total 4
-rw-r--r-- 1 root root 1744 Mar 10 16:52 ceph.conf
-rw-r--r-- 1 root root 1744 Mar 10 16:52 /var/lib/one/remotes/etc.2020-03-10/datastore/ceph/ceph.conf
total 4
drwxr-x--- 2 root root 4096 Mar 10 16:52 fs_lvm
total 4
-rw-r--r-- 1 root root 1577 Mar 10 16:52 fs_lvm.conf
-rw-r--r-- 1 root root 1577 Mar 10 16:52 /var/lib/one/remotes/etc.2020-03-10/tm/fs_lvm/fs_lvm.conf

We should use oneadmin user to make those backups next time. I have moved the spurious /var/lib/one/remotes/etc.xxxxxxx directories and now the sync is working correctly as oneadmin:

$ onehost sync
* Adding hyp107.altaria.os to upgrade
* Adding hyp106.altaria.os to upgrade
* Adding hyp105.altaria.os to upgrade
[========================================] 3/3 hyp105.altaria.os 

So the sync issue is fixed, thanks a lot!

Cheers
Álvaro

Hi @cgonzalez

Also more good news! with this fix also the hosts now are available again, it seems this has fixed also another issue (Hosts in error after upgrading to 5.12.0)

Cheers
Álvaro

2 Likes