How to recover from FAILED state?

soonthor · April 1, 2015, 2:52pm

This is on Openebula 4.10.0
The VM created succesfully. I can ssh to it. Then I tried to Migrate to other node. It put VM to FAILED state. My question is what is the right way to recover from FAILED. It appears that any options I choose (PLAY, BOOT, RECOVER) not working due to “Wrong state to perform that action”. Should I try “DELETE”? What DELETE really do?

Here is the log when it’s in failed state.
Mon Mar 30 16:14:40 2015 [Z0][DiM][I]: New VM state is ACTIVE.
Mon Mar 30 16:14:40 2015 [Z0][LCM][I]: New VM state is PROLOG.
Mon Mar 30 16:14:46 2015 [Z0][LCM][I]: New VM state is BOOT
Mon Mar 30 16:14:46 2015 [Z0][VMM][I]: Generating deployment file: /var/lib/one/vms/81/deployment.0
Mon Mar 30 16:14:46 2015 [Z0][VMM][I]: ExitCode: 0
Mon Mar 30 16:14:46 2015 [Z0][VMM][I]: Successfully execute network driver operation: pre.
Mon Mar 30 16:14:49 2015 [Z0][VMM][I]: ExitCode: 0
Mon Mar 30 16:14:49 2015 [Z0][VMM][I]: Successfully execute virtualization driver operation: deploy.
Mon Mar 30 16:14:49 2015 [Z0][VMM][I]: ExitCode: 0
Mon Mar 30 16:14:49 2015 [Z0][VMM][I]: Successfully execute network driver operation: post.
Mon Mar 30 16:14:49 2015 [Z0][LCM][I]: New VM state is RUNNING

Mon Mar 30 18:43:44 2015 [Z0][LCM][I]: New VM state is SAVE_MIGRATE
Mon Mar 30 18:44:35 2015 [Z0][VMM][I]: ExitCode: 0
Mon Mar 30 18:44:35 2015 [Z0][VMM][I]: Successfully execute virtualization driver operation: save.
Mon Mar 30 18:44:35 2015 [Z0][VMM][I]: ExitCode: 0
Mon Mar 30 18:44:35 2015 [Z0][VMM][I]: Successfully execute network driver operation: clean.
Mon Mar 30 18:44:35 2015 [Z0][LCM][I]: New VM state is PROLOG_MIGRATE
Mon Mar 30 18:44:36 2015 [Z0][LCM][I]: New VM state is BOOT
Mon Mar 30 18:44:36 2015 [Z0][VMM][I]: ExitCode: 0
Mon Mar 30 18:44:36 2015 [Z0][VMM][I]: Successfully execute network driver operation: pre.
Mon Mar 30 18:44:36 2015 [Z0][VMM][I]: Command execution fail: /var/tmp/one/vmm/kvm/restore '/var/lib/one//datastores/0/81/checkpoint' 'node-002.dc1.xxx.com' 'one-81' 81 node-002
.dc1.xxx.com
Mon Mar 30 18:44:36 2015 [Z0][VMM][E]: restore: Command "virsh --connect qemu:///system restore /var/lib/one//datastores/0/81/checkpoint" failed: error: Failed to restore domain from 
/var/lib/one//datastores/0/81/checkpoint
Mon Mar 30 18:44:36 2015 [Z0][VMM][I]: error: unsupported configuration: Unable to find security driver for label apparmor
Mon Mar 30 18:44:36 2015 [Z0][VMM][E]: Could not restore from /var/lib/one//datastores/0/81/checkpoint
Mon Mar 30 18:44:36 2015 [Z0][VMM][I]: ExitCode: 1
Mon Mar 30 18:44:36 2015 [Z0][VMM][I]: Failed to execute virtualization driver operation: restore.
Mon Mar 30 18:44:36 2015 [Z0][VMM][E]: Error restoring VM: Could not restore from /var/lib/one//datastores/0/81/checkpoint
Mon Mar 30 18:44:36 2015 [Z0][DiM][I]: New VM state is FAILED

anandharaj · April 6, 2015, 6:02am

Use “delete and recreate”

cmartin · April 6, 2015, 8:28am

Hi,

Unfortunately, once the VM enters the failed state it can’t be recovered. The only option is to shutdown the VM manually, and then delete it from OpenNebula.

We are working to improve this in the next release. See #3654 for details.

mike · October 24, 2015, 12:49pm

Hello,
I’ve just moment ago solved this.
First of I use ONE v4.12 (Debian 7) + MySQL, shared LVM datastore, xen 4.5.

When a VM get status FAILED I tried to run it using direct communication with hypervisor.
I.e. xl create /Path/to/vm/config
In the VM Config file was written the fitted for VM options (I mean MAC, Disk, RAM, IF)
Finaly VM stared successfully.
Turn off the VM.
Stop all opennebula services.
Now start working with DB.

Just in case make a full db backup.
Then I’m trying to show WHAT and WHERE it was changed step-by-step.
Here VM_ID - ID of FAILED VM
mysql -u root -h <MYSQL_IP> -p
mysql> delete from vm_monitoring where vmid=VM_ID; # I’m nut sure is it really need, but I don’t want to make such kind of “test” one more time.
mysql> update vm_pool set state=8 where oid=VM_ID;

Try to start back opennebula services.
Check the VM state it should be “POWEROFF”.
Try to start VM (in my case it looks like regular start of regular VM)

Hope it help!
P.S. Message for developers: Guys, It’s really great Cloud solution, one of the best, but offer to users just RECREATE filed VM is horrible solution! VM can be failed because of network, storage, e.t.c. reasons but environment can be repaired and failed VM also have to be too.

ONE 4.14 already has tools for recovering FAILED VM. But I can’t install it cause it requires to update to Jessie, but this release is not have Pacemaker packet witch I use.
http://docs.opennebula.org/4.14/release_notes/release_notes/whats_new.html

Topic		Replies	Views
After upgrade to 5 one vm stuck in FAILED state Product Support	6	909	October 30, 2016
ONE 4.11.80: DELETE+RECREATE fails to restart the VM Product Support	4	1097	March 10, 2015
Migration failure on 6.10.0 - missing to_s. How to recover? Product Support	6	107	June 25, 2025
What to do with a ghost VM? Product Support	5	319	June 25, 2025
Move SHUTDOWN VMs from hosts in error Product Support	6	926	March 24, 2015

How to recover from FAILED state?

Related topics