VMs fail to start after hypervisor failure

Hello everyone,

We are using OpenNebula 5.12 in a HA environment composed of 3 FEs and 3 HVs. We are using shared storage (NFS & fs_lvm) for VMs.

DATASTORE 102 INFORMATION
ID : 102
NAME : nebula_stor_01
USER : oneadmin
GROUP : oneadmin
CLUSTERS : 0
TYPE : SYSTEM
DS_MAD : -
TM_MAD : fs_lvm
BASE PATH : /var/lib/one//datastores/102
DISK_TYPE : FILE
STATE : READY

DATASTORE CAPACITY
TOTAL: : 10T
FREE: : 9.9T
USED: : 100G
LIMIT: : -

PERMISSIONS
OWNER : uma
GROUP : um-
OTHER : u–

DATASTORE TEMPLATE
ALLOW_ORPHANS=“NO”
BRIDGE_LIST=“nebhv01 nebhv02 nebhv03”
DISK_TYPE=“FILE”
DS_MIGRATE=“YES”
RESTRICTED_DIRS=“/”
SAFE_DIRS=“/var/tmp”
SHARED=“YES”
TM_MAD=“fs_lvm”
TYPE=“SYSTEM_DS”

The problem starts to appear when a hypervisor is crashing, failing to restart the VMs deployed on it to another healthy one.
We are using the host_error hook with fencing enabled:

HOOK 0 INFORMATION
ID                : 0
NAME              : host_error
TYPE              : state
LOCK              : None

HOOK TEMPLATE
ARGUMENTS="$TEMPLATE -m -p 0"
COMMAND="ft/host_error.rb"
NAME="host_error"
REMOTE="NO"
RESOURCE="HOST"
STATE="ERROR"
TYPE="state"

Here are the logs for the hook when hypervisor goes down:

[HOST 6][I] Hook launched
[HOST 6][I] hostname: nebhv02
[HOST 6][I] Fencing enabled
[HOST 6][I] Success: Rebooted
[HOST 6][I] Fencing success
[HOST 6][I] states: 3, 5, 8
[HOST 6][I] vms: [“50”]
[HOST 6][I] resched 50
[HOST 6][I] Hook finished

Here are the logs for the VM:

[VM 50][Z0][VM][I]: New LCM state is UNKNOWN
[VM 50][Z0][VM][I]: New LCM state is PROLOG_MIGRATE_UNKNOWN
[VM 50][Z0][VM][I]: New state is ACTIVE
[VM 50][Z0][TM][I]: Command execution failed (exit code: 255): /var/lib/one/remotes/tm/fs_lvm/mv nebhv02:/var/lib/one//datastores/102/50/disk.0 nebhv03:/var/lib/one//datastores/102/50/disk.0 50 104
[VM 50][Z0][TM][E]: mv: Command " set -ex -o pipefail
[VM 50][Z0][TM][I]: if [ -b “/dev/vg-one-102/lv-one-50-0” ]; then
[VM 50][Z0][TM][I]: sync
[VM 50][Z0][TM][I]: sudo -n lvscan
[VM 50][Z0][TM][I]: sudo -n lvchange -an “/dev/vg-one-102/lv-one-50-0”
[VM 50][Z0][TM][I]: fi
[VM 50][Z0][TM][I]: rm -f “/var/lib/one/datastores/102/50/.host” || :" failed: ssh: connect to host nebhv02 port 22: No route to host
[VM 50][Z0][TM][E]: Error deactivating disk /var/lib/one/datastores/102/50/disk.0
[VM 50][Z0][TM][E]: Error executing image transfer script: Error deactivating disk /var/lib/one/datastores/102/50/disk.0
[VM 50][Z0][VM][I]: New LCM state is PROLOG_MIGRATE_UNKNOWN_FAILURE

Any clue to why this behavior is happening and how to fix this?
Thank you!

P.S. I can provide more details if necessary.

Check if stored hosts credentials are correct, maybe you did also change any routing configurations, so after hypervisor crash?

Hey @slnt_opp ,

i can confirm credentials are correct and no routing changes have been made neither before or after HV crash.