V5.0.0-1: [VirtualMachineInfo] Error getting virtual machine

Hi,

I just had a power supply failure on servers runing one 5.0.0-1. After the reboot, I have a vm stuck in the state running, while not being effectively running. Every action on that vm fails with the message:

[VirtualMachineInfo] Error getting virtual machine 10

I have suspicious messages in oned.log:

Wed Jun 22 18:13:19 2016 [Z0][ONE][E]: SQL command was: SELECT body FROM history WHERE vid = 10 AND seq = 5, error: callback requested query abort
Wed Jun 22 18:13:19 2016 [Z0][ONE][E]: SQL command was: SELECT body FROM history WHERE vid = 13 AND seq = 1, error: callback requested query abort

The logs of the concerned vm have not been updated since the crash:

Mon Jun 13 12:20:36 2016 [Z0][VM][I]: New LCM state is BOOT_POWEROFF
Mon Jun 13 12:20:37 2016 [Z0][VMM][I]: Generating deployment file: /var/lib/one/vms/10/deployment.6
Mon Jun 13 12:20:37 2016 [Z0][VMM][I]: ExitCode: 0
Mon Jun 13 12:20:37 2016 [Z0][VMM][I]: Successfully execute network driver operation: pre.
Mon Jun 13 12:20:38 2016 [Z0][VMM][I]: ExitCode: 0
Mon Jun 13 12:20:38 2016 [Z0][VMM][I]: Successfully execute virtualization driver operation: deploy.
Mon Jun 13 12:20:38 2016 [Z0][VMM][I]: ExitCode: 0
Mon Jun 13 12:20:38 2016 [Z0][VMM][I]: Successfully execute network driver operation: post.
Mon Jun 13 12:20:38 2016 [Z0][VM][I]: New LCM state is RUNNING

I see the disk is still present on the node:

ls /var/lib/one/datastores/102/10/
deployment.0 deployment.2 deployment.4 deployment.6 disk.1
deployment.1 deployment.3 deployment.5 disk.0 disk.1.iso

but the vm is not running at all, the domain doesn’t even exists (no trace of one-10 with virsh list --all).

Any clue ?

Thanks for your time ! (and for that great project)

Im assume you are using MySQL for the database. If yes, you may stop all the OpenNebula services and do DB check/repair first. After that, start the OpenNebula services and monitor the logs.

Hi,

Thanks for your answer. I currently use sqlite. Integrity check seems ok:

sqlite> pragma integrity_check;
ok

Any idea ?

Thanks,

try:

onedb fsck [your options]

Hi,

It obviously had an effect:

onedb fsck -s one.db
Sqlite database backup stored in /var/lib/one/one.db_2016-6-24_12:2:55.bck
Use ‘onedb restore’ or copy the file back to restore the DB.

Host 1 RUNNING_VMS has 3 is 2
VM 11 is in Host 1 VM id list, but it should not
Host 1 CPU_USAGE has 325 is 125
Host 1 MEM_USAGE has 5373952 is 2228224
Image 1 RUNNING_VMS has 2 is 1
VM 11 is in Image 1 VM id list, but it should not
VNet 0 AR 0 has leased 10.4.3.91 to VM 11, but it is actually free
VNet 0 has 4 used leases, but it is actually 3
VNet 1 AR 1 has leased 10.88.50.62 to VM 11, but it is actually free
VNet 1 has 5 used leases, but it is actually 4
VNet 3 AR 0 has leased 10.88.12.60 to VM 11, but it is actually free
VNet 3 has 1 used leases, but it is actually 0

Total errors found: 12
Total errors repaired: 12
Total errors unrepaired: 0
A copy of this output was stored in /var/log/one/onedb-fsck.log

However, the vm is still “frozen” and I can’t do anything about it.

Thanks,

I had a similar problem since the upgrade from 4.12 to 5. Some machines worked fine, but others were stuck with the same errors as bpetit’s:

Running fsck did not output any errors, and restoring the db and running the upgrade again didn’t help.
The way I solved it, and I don’t think this is a correct way to do it, so try at your own risk, was to:

  1. Stop one
  2. Backup the database
  3. Delete the entries that cause the error from the history table - in your case: delete from history where vid in(10, 13)
  4. Start one, and power on the VM(s) in question

Hi,

It works ! Thank you !
This is a bug isnt it ? Shall I report it ?

Thanks,

Hey!

Just wanted to quickly thank axc.
I encountered the same issue after a pretty botched (my own fault) upgrade from 4.12 to 5.0 and deleting entries in history (sqlite table) for the troublesome VMs worked wonders.

Cheers,
-Andrew