Recover vms stuck in UNKOWN state

We upgraded our cluster from 5.10 to 5.12. We mistakenly had a few hosts that were downgraded back to 5.10.

Even after re-upgrading those KVM hosts to 5.12, the VMs on those host are stuck in UNKOWN state.

How do we recover these VMs?

Answer

We recovered 90% of the vms by running the followiong on the opennebula node

sudo -u oneadmin onehost sync --force

To recover the vms on a specific host, add the host ID

vms=$(onevm list --csv | grep unkn | tail -n 1)  #Picks a single VM stuck in UNKNOWN state
vm_id=$(echo $vms | awk '{split($0,a,","); print a[1]}' )
vm_name=$(echo $vms | awk '{split($0,a,","); print a[4]}' )
kvm_name=$(echo $vms | awk '{split($0,a,","); print a[8]}' )

echo $vm_id
echo $vm_name
echo $kvm_name

sudo -u oneadmin onehost sync $kvm_name --force

For the remaining vms we did the following

  1. Ensure the host is enabled
onehost enable $kvm_id # you can get the id with 'onehost list'
onehost show $kvm_id | grep 'STATE\|LAST' # Wait until this transisions from INIT to MONITORED
  1. Restart libvirt-bin on the kvm host
sudo service libvirt-bin restart
  1. Re-run the sync on the nebula host
sudo -u oneadmin onehost sync --force
1 Like

If the host still isn’t updating, Try killing the monitord process, removing the pid and then forcing a resync.

# On host
ps aux | grep monitord
kill <pid of monitord>
rm /tmp/onemonitord-<hostid>.error
rm /tmp/one-monitord-<hostid>.pid
# On OpenNebula
onehost enable <hostid>
sudo -u oneadmin onehost enable <hostid> --force
sudo -u oneadmin onehost forcesync

If this doesn’t resolve the issue try the following

# On KVM host
mv /var/tmp/one /var/tmp/one.backup

We did have a few hosts that took a couple of sync attempts to to fix

onehost enable 72
sudo -u oneadmin onehost sync 42 --force

Should be

sudo -u oneadmin onehost sync <hostid> --force