Monitoring Problem (?): Error parsing host information: syntax error, unexpected VARIABLE, expecting EQUAL or EQUAL_EMPTY

We are currently investigating a issue in our OpenNebula clusters,
which leads to a VMM host “crash” during live migration of VMs.

The “crash” is caused by active fencing of a failed VMM host in our environment by IPMI.
The fencing mechanism is triggered by HOST_HOOK on "ERROR" in OpenNebula.

The error state of the host is caused by an parsing error in OpenNebula:

Thu Feb  4 15:28:36 2016 [Z0][ONE][E]: Error parsing host information: syntax error, unexpected VARIABLE, expecting EQUAL or EQUAL_EMPTY at line 1, columns 8:14. Monitoring information: 
error: failed to get domain 'one-247'
error: Domain not found: no domain with matching name 'one-247'
ARCH=x86_64
MODELNAME="Intel(R) Xeon(R) CPU           E5506  @ 2.13GHz"
HYPERVISOR=kvm
TOTALCPU=800
CPUSPEED=2133
TOTALMEMORY=24730300
USEDMEMORY=773364
FREEMEMORY=23956936
FREECPU=776
USEDCPU=24
NETRX=3479713683
NETTX=8991810567
DS_LOCATION_USED_MB=3324
DS_LOCATION_TOTAL_MB=135220
DS_LOCATION_FREE_MB=125005
...
...

For full output see: log_output.txt (25,2 KB)

It seems, that STDOUT/STDERR of libvirt is reported to OpenNebula, which leads to the parse error.
On the VMM host following is reported in the logs:

grep one-247 /var/log/syslog
Feb  4 15:28:33 cloudstage-staging-node02 libvirtd: 884: error : virDBusCall:1537 : error from service: TerminateMachine: No machine 'qemu-one-247' known

Is this a known problem - or just a “individual issue of our environment”?

We are able to reproduce this in our testlab.

But: We run several OpenNebula clusters and in one datacenter location we cannot reproduce this issue.
Every site, zone and cluster is operated with the same versions.

Details of our environment:

OS: debian 8.3 (Jessie)
OpenNebula: 4.14.2-2
QEMU/KVM: 2.1+dfsg-12+deb8u4
Libvirt: 1.2.9-9+deb8u1

Best regards,

Sebastian

Hello,

I just encountered the same issue.

OS: Ubuntu Trusty
OpenNebula: 4.14.2-2
Qemu/KVM: 2.0.0+dfsg-2ubuntu1.22
Libvirt: 1.2.2-0ubuntu13.1.16 

Should we open a bug report?

Regards.

I found that the host in question has a VM running but without its corresponding /var/lib/one/datastores/<SYSTEM>/<VMID>/disk.*

It was due to running a VM without the workaround in http://dev.opennebula.org/issues/4316.

Regards.

I just tried it again in our testlab with the same software versions as operated in our production environment. The issue is easy to reproduce.

Description of my litte test:

2 VMM Hosts (192.168.30.146 & 192.168.30.147), 10 running VMs.
Start a simple live migration of 4 VMs from one host to the other.
This causes the parse error. I sniffed all commands executed on the hosts with “snoopy”.

See some of my debugging output (Syslog):

1.) Live-Migration of VM “one-247” to VMM Host “192.168.30.145” starts:

Feb 7 17:18:52 cloudstage-staging-node03 snoopy[15098]: [uid:9869 sid:15088 tty: cwd:/var/lib/one filename:/usr/bin/virsh]: virsh --connect qemu:///system migrate --live one-247 qemu+ssh://192.168.30.145/system

2.) oned calls for VM Monitoring of VM “one-247” after migration is started:

Feb 7 17:18:55 cloudstage-staging-node03 snoopy[15311]: [uid:9869 sid:1595 tty: cwd:/var/tmp/one/im/kvm-probes.d filename:/usr/bin/virsh]: virsh --connect qemu:///system --readonly dominfo one-247

3.) VM “one-247” has been migrated to VMM host “192.168.30.145”:

Feb 7 17:18:56 cloudstage-staging-node03 systemd-machined[2446]: Machine qemu-one-247 terminated.

4.) oned calls again for VM monitoring after the life-migration completed

Feb  7 17:18:56 cloudstage-staging-node03 snoopy[15369]: [uid:9869 sid:15088 tty: cwd:/var/lib/one filename:/usr/bin/virsh]: virsh -c qemu:///system dumpxml one-247
Feb  7 17:18:58 cloudstage-staging-node03 snoopy[15504]: [uid:9869 sid:1595 tty: cwd:/var/tmp/one/im/kvm-probes.d filename:/usr/bin/virsh]: virsh --connect qemu:///system --readonly dumpxml one-247

This causes the error (Output is reported to oned, which causes the parse error):

root@cloudstage-staging-node03(~)# virsh --connect qemu:///system --readonly dumpxml one-247
error: failed to get domain 'one-247'
error: Domain not found: no domain with matching name 'one-247'

Output of OpenNebula “oned.log”:

Sun Feb  7 17:18:59 2016 [Z0][ONE][E]: Error parsing host information: syntax error, unexpected VARIABLE, expecting EQUAL or EQUAL_EMPTY at line 1, columns 8:14. Monitoring information: 
error: failed to get domain 'one-247'
error: Domain not found: no domain with matching name 'one-247'

I think this is a bug. What do you think?
Why OpenNebula is querying the migrated VM on the leaved Host?

Best regards,

Sebastian.

Hi,

In our test lab we have this behavior by time to time, and we also think there was a glich because the host is in error state just a fraction of time (but enough to trigger the host_hook) and then it switches to MONITOR_ERROR state.

As we are focused primary on our storage addon (and there are enough complications), we do not pay enough attention to the issue and tried only to reduce the impact of the false positives. (I was thinking that it was because of the libvirtd authentication timeout issue).

In the developed host_hook script for VM migration we do an extra check - are there VMs in RUNNING/UNKNOWN (VM STATE/LCM_STATE) state on this host and only then we are “starting the game”. Our observation is that the VMs on the host are marked as failed only after two monitoring cycles reported error - when there is indeed error.

You can try something similar until there is proper upstream solution.

Side note: In the mentioned host_hook script we are doing at first glance weird things but for a reason that is not related to this topic.

Kind Regards,
Anton Todorov

Maybe we should open a bug report?

Maybe we should open a bug report?

I tried bug the redmine is not responding when I fill a new issue :-/

Regards.
Daniel Dehennin
Récupérer ma clef GPG: gpg --recv-keys 0xCC1E9E5B7A6FE2DF
Fingerprint: 3E69 014E 5C23 50E8 9ED6 2AAD CC1E 9E5B 7A6F E2DF

It does work for me (over https too). Can you try again?

Jaime Melis forum@opennebula.org writes:

It does work for me (over https too). Can you try again?

Ok, I switched to https and it works[1].

I tried several times over HTTP since my post here and it did not works.

Thanks.

Footnotes:
[1] https://dev.opennebula.org/issues/4329

Daniel Dehennin
Récupérer ma clef GPG: gpg --recv-keys 0xCC1E9E5B7A6FE2DF
Fingerprint: 3E69 014E 5C23 50E8 9ED6 2AAD CC1E 9E5B 7A6F E2DF