We are currently investigating a issue in our OpenNebula clusters,
which leads to a VMM host “crash” during live migration of VMs.
The “crash” is caused by active fencing of a failed VMM host in our environment by IPMI.
The fencing mechanism is triggered by
"ERROR" in OpenNebula.
The error state of the host is caused by an parsing error in OpenNebula:
Thu Feb 4 15:28:36 2016 [Z0][ONE][E]: Error parsing host information: syntax error, unexpected VARIABLE, expecting EQUAL or EQUAL_EMPTY at line 1, columns 8:14. Monitoring information: error: failed to get domain 'one-247' error: Domain not found: no domain with matching name 'one-247' ARCH=x86_64 MODELNAME="Intel(R) Xeon(R) CPU E5506 @ 2.13GHz" HYPERVISOR=kvm TOTALCPU=800 CPUSPEED=2133 TOTALMEMORY=24730300 USEDMEMORY=773364 FREEMEMORY=23956936 FREECPU=776 USEDCPU=24 NETRX=3479713683 NETTX=8991810567 DS_LOCATION_USED_MB=3324 DS_LOCATION_TOTAL_MB=135220 DS_LOCATION_FREE_MB=125005 ... ...
For full output see: log_output.txt (25,2 KB)
It seems, that STDOUT/STDERR of libvirt is reported to OpenNebula, which leads to the parse error.
On the VMM host following is reported in the logs:
grep one-247 /var/log/syslog Feb 4 15:28:33 cloudstage-staging-node02 libvirtd: 884: error : virDBusCall:1537 : error from service: TerminateMachine: No machine 'qemu-one-247' known
Is this a known problem - or just a “individual issue of our environment”?
We are able to reproduce this in our testlab.
But: We run several OpenNebula clusters and in one datacenter location we cannot reproduce this issue.
Every site, zone and cluster is operated with the same versions.
Details of our environment:
OS: debian 8.3 (Jessie)