Host crash - how to handle it correctly?

Yenya · August 16, 2023, 2:41pm

Today a host in our ONe cluster crashed (a HW error, reportedly to be fixed by a pending BIOS upgrade), and I discovered that I don’t know exactly what is the proper way of handling the host crash. How can I tell OpenNebula something like “this host has crashed and rebooted, try to recover/reschedule everything that has been running on it”?

For crashed hosts it is not easy, because oned cannot possibly know whether the host is really down or just overloaded and lagging. But for rebooted hosts, from the host uptime oned should know that it is safe to assume that everything previously running on that host is definitely gone.

I tried to run “reschedule” on one VM, and “undeploy/deploy” on another, but both crashed on boot with I/O errors on /dev/sda. The problem was that apparently Qemu locks the Ceph RBD image when it is in use, and after the crash/reboot the lock remains in place. So the VMs were getting I/O errors on writes to their disks. Is OpenNebula supposed to handle this and remove the lock?

FWIW, I unlocked the RBD images the following way:

# Make sure no new VMs get scheduled onto a rebooted host:
onehost disable $CRASHED_HOST
# Verify that no VMs are running on that host.
# Get a list of VMs on that host which are in the UNKNOWN state,
# or, if the host is already rebooted, in the POWEROFF state
# and check logs which ones were indeed running at the time of crash.
# Get a list of images locked by that host:
rbd ls one | while read image
do
    rbd lock ls one/$image | grep -q $IP_OF_CRASHED_HOST:0 \
        && echo $image
done > /tmp/locked-images
# Remove the locks (verify that the output looks ok and then re-run
# with `echo` below removed):
while read image
do
    id="`rbd lock ls one/$image --format json | jq -r .[0].id`"
    locker="`rbd lock ls one/$image --format json | jq -r .[0].locker`"
    echo rbd lock rm one/$image "$id" "$locker"
done < /tmp/locked-images
# Restart the crashed VMs. I did this without rescheduling them
# to another host, because I wanted to test whether the BIOS
# upgrade helped.
onevm resume 1234,5678,1235,1236,...
onehost enable $CRASHED_HOST

But in my opinion ONe should be able to do this itself. So, what is the correct way of handling a crashed host? Thanks,

-Yenya

dclavijo · August 22, 2023, 2:57pm

Very interesting topic. Host crashes and consequences are quite tricky.

Whether the host was crashed or not you can write your own script to determine it. Then you can trigger this script when a host goes from RUNNING to ERROR (the monitoring should kick in) using Host state hooks.

Then the script you have to unlock RBD images can be piped after the last script given that host is deemed crash.

It’s sort of like the VM HA which is a hook that triggers on host error. Then how to handle it is defined on the script of such hook.

Topic		Replies	Views
Database crash after bad power down Community Support	7	1364	February 3, 2016
Host in ERROR after hard reboot Community Support	1	626	August 22, 2019
VM fails to (HA) reschedule on host error - OpenNebula 5.0.1 Community Support	3	1316	July 14, 2016
VM crash randomly Community Support	8	722	October 2, 2018
VM crashes one by one at a time Community Support	9	1448	July 6, 2017

Host crash - how to handle it correctly?

Related topics