Hello everyone,
We are using OpenNebula 5.12 in a HA environment composed of 3 FEs and 3 HVs. We are using shared storage (NFS & fs_lvm) for VMs.
DATASTORE 102 INFORMATION
ID : 102
NAME : nebula_stor_01
USER : oneadmin
GROUP : oneadmin
CLUSTERS : 0
TYPE : SYSTEM
DS_MAD : -
TM_MAD : fs_lvm
BASE PATH : /var/lib/one//datastores/102
DISK_TYPE : FILE
STATE : READYDATASTORE CAPACITY
TOTAL: : 10T
FREE: : 9.9T
USED: : 100G
LIMIT: : -PERMISSIONS
OWNER : uma
GROUP : um-
OTHER : u–DATASTORE TEMPLATE
ALLOW_ORPHANS=“NO”
BRIDGE_LIST=“nebhv01 nebhv02 nebhv03”
DISK_TYPE=“FILE”
DS_MIGRATE=“YES”
RESTRICTED_DIRS=“/”
SAFE_DIRS=“/var/tmp”
SHARED=“YES”
TM_MAD=“fs_lvm”
TYPE=“SYSTEM_DS”
The problem starts to appear when a hypervisor is crashing, failing to restart the VMs deployed on it to another healthy one.
We are using the host_error hook with fencing enabled:
HOOK 0 INFORMATION ID : 0 NAME : host_error TYPE : state LOCK : None HOOK TEMPLATE ARGUMENTS="$TEMPLATE -m -p 0" COMMAND="ft/host_error.rb" NAME="host_error" REMOTE="NO" RESOURCE="HOST" STATE="ERROR" TYPE="state"
Here are the logs for the hook when hypervisor goes down:
[HOST 6][I] Hook launched
[HOST 6][I] hostname: nebhv02
[HOST 6][I] Fencing enabled
[HOST 6][I] Success: Rebooted
[HOST 6][I] Fencing success
[HOST 6][I] states: 3, 5, 8
[HOST 6][I] vms: [“50”]
[HOST 6][I] resched 50
[HOST 6][I] Hook finished
Here are the logs for the VM:
[VM 50][Z0][VM][I]: New LCM state is UNKNOWN
[VM 50][Z0][VM][I]: New LCM state is PROLOG_MIGRATE_UNKNOWN
[VM 50][Z0][VM][I]: New state is ACTIVE
[VM 50][Z0][TM][I]: Command execution failed (exit code: 255): /var/lib/one/remotes/tm/fs_lvm/mv nebhv02:/var/lib/one//datastores/102/50/disk.0 nebhv03:/var/lib/one//datastores/102/50/disk.0 50 104
[VM 50][Z0][TM][E]: mv: Command " set -ex -o pipefail
[VM 50][Z0][TM][I]: if [ -b “/dev/vg-one-102/lv-one-50-0” ]; then
[VM 50][Z0][TM][I]: sync
[VM 50][Z0][TM][I]: sudo -n lvscan
[VM 50][Z0][TM][I]: sudo -n lvchange -an “/dev/vg-one-102/lv-one-50-0”
[VM 50][Z0][TM][I]: fi
[VM 50][Z0][TM][I]: rm -f “/var/lib/one/datastores/102/50/.host” || :" failed: ssh: connect to host nebhv02 port 22: No route to host
[VM 50][Z0][TM][E]: Error deactivating disk /var/lib/one/datastores/102/50/disk.0
[VM 50][Z0][TM][E]: Error executing image transfer script: Error deactivating disk /var/lib/one/datastores/102/50/disk.0
[VM 50][Z0][VM][I]: New LCM state is PROLOG_MIGRATE_UNKNOWN_FAILURE
Any clue to why this behavior is happening and how to fix this?
Thank you!
P.S. I can provide more details if necessary.