I just had a power supply failure on servers runing one 5.0.0-1. After the reboot, I have a vm stuck in the state running, while not being effectively running. Every action on that vm fails with the message:
Wed Jun 22 18:13:19 2016 [Z0][ONE][E]: SQL command was: SELECT body FROM history WHERE vid = 10 AND seq = 5, error: callback requested query abort
Wed Jun 22 18:13:19 2016 [Z0][ONE][E]: SQL command was: SELECT body FROM history WHERE vid = 13 AND seq = 1, error: callback requested query abort
The logs of the concerned vm have not been updated since the crash:
Mon Jun 13 12:20:36 2016 [Z0][VM][I]: New LCM state is BOOT_POWEROFF
Mon Jun 13 12:20:37 2016 [Z0][VMM][I]: Generating deployment file: /var/lib/one/vms/10/deployment.6
Mon Jun 13 12:20:37 2016 [Z0][VMM][I]: ExitCode: 0
Mon Jun 13 12:20:37 2016 [Z0][VMM][I]: Successfully execute network driver operation: pre.
Mon Jun 13 12:20:38 2016 [Z0][VMM][I]: ExitCode: 0
Mon Jun 13 12:20:38 2016 [Z0][VMM][I]: Successfully execute virtualization driver operation: deploy.
Mon Jun 13 12:20:38 2016 [Z0][VMM][I]: ExitCode: 0
Mon Jun 13 12:20:38 2016 [Z0][VMM][I]: Successfully execute network driver operation: post.
Mon Jun 13 12:20:38 2016 [Z0][VM][I]: New LCM state is RUNNING
Im assume you are using MySQL for the database. If yes, you may stop all the OpenNebula services and do DB check/repair first. After that, start the OpenNebula services and monitor the logs.
onedb fsck -s one.db
Sqlite database backup stored in /var/lib/one/one.db_2016-6-24_12:2:55.bck
Use ‘onedb restore’ or copy the file back to restore the DB.
Host 1 RUNNING_VMS has 3 is 2
VM 11 is in Host 1 VM id list, but it should not
Host 1 CPU_USAGE has 325 is 125
Host 1 MEM_USAGE has 5373952 is 2228224
Image 1 RUNNING_VMS has 2 is 1
VM 11 is in Image 1 VM id list, but it should not
VNet 0 AR 0 has leased 10.4.3.91 to VM 11, but it is actually free
VNet 0 has 4 used leases, but it is actually 3
VNet 1 AR 1 has leased 10.88.50.62 to VM 11, but it is actually free
VNet 1 has 5 used leases, but it is actually 4
VNet 3 AR 0 has leased 10.88.12.60 to VM 11, but it is actually free
VNet 3 has 1 used leases, but it is actually 0
Total errors found: 12
Total errors repaired: 12
Total errors unrepaired: 0
A copy of this output was stored in /var/log/one/onedb-fsck.log
However, the vm is still “frozen” and I can’t do anything about it.
I had a similar problem since the upgrade from 4.12 to 5. Some machines worked fine, but others were stuck with the same errors as bpetit’s:
Running fsck did not output any errors, and restoring the db and running the upgrade again didn’t help.
The way I solved it, and I don’t think this is a correct way to do it, so try at your own risk, was to:
Stop one
Backup the database
Delete the entries that cause the error from the history table - in your case: delete from history where vid in(10, 13)
Just wanted to quickly thank axc.
I encountered the same issue after a pretty botched (my own fault) upgrade from 4.12 to 5.0 and deleting entries in history (sqlite table) for the troublesome VMs worked wonders.