We are currently investigating a issue in our OpenNebula clusters,
which leads to a VMM host “crash” during live migration of VMs.
The “crash” is caused by active fencing of a failed VMM host in our environment by IPMI.
The fencing mechanism is triggered by HOST_HOOK
on "ERROR"
in OpenNebula.
The error state of the host is caused by an parsing error in OpenNebula:
Thu Feb 4 15:28:36 2016 [Z0][ONE][E]: Error parsing host information: syntax error, unexpected VARIABLE, expecting EQUAL or EQUAL_EMPTY at line 1, columns 8:14. Monitoring information:
error: failed to get domain 'one-247'
error: Domain not found: no domain with matching name 'one-247'
ARCH=x86_64
MODELNAME="Intel(R) Xeon(R) CPU E5506 @ 2.13GHz"
HYPERVISOR=kvm
TOTALCPU=800
CPUSPEED=2133
TOTALMEMORY=24730300
USEDMEMORY=773364
FREEMEMORY=23956936
FREECPU=776
USEDCPU=24
NETRX=3479713683
NETTX=8991810567
DS_LOCATION_USED_MB=3324
DS_LOCATION_TOTAL_MB=135220
DS_LOCATION_FREE_MB=125005
...
...
For full output see: log_output.txt (25,2 KB)
It seems, that STDOUT/STDERR of libvirt is reported to OpenNebula, which leads to the parse error.
On the VMM host following is reported in the logs:
grep one-247 /var/log/syslog
Feb 4 15:28:33 cloudstage-staging-node02 libvirtd: 884: error : virDBusCall:1537 : error from service: TerminateMachine: No machine 'qemu-one-247' known
Is this a known problem - or just a “individual issue of our environment”?
We are able to reproduce this in our testlab.
But: We run several OpenNebula clusters and in one datacenter location we cannot reproduce this issue.
Every site, zone and cluster is operated with the same versions.
Details of our environment:
OS: debian 8.3 (Jessie)
OpenNebula: 4.14.2-2
QEMU/KVM: 2.1+dfsg-12+deb8u4
Libvirt: 1.2.9-9+deb8u1
Best regards,
Sebastian