Sometimes when migrating “wilds” vm the run_probes script fails and onebula thinks the host is in error state. This becomes a problem when I migrate many VMs and the run_probe script fails 3 times in a row, then the fence is triggered rebooting my host…
Versions of the related components and OS (frontend, hypervisors, VMs):
opennebula 5.4.13
centos 7
Steps to reproduce:
migrate VMs, while migration is happening (I don’t know exactly the right moment), run the run_probe script many times and you will sometime get the following error:
[root@ord-virt-004 ~]# /var/tmp/one/im/run_probes kvm /var/lib/one//datastores 4124 20 2 ord-virt-004 ../../vmm/kvm/poll:403:in `xml_to_one': undefined method `text' for nil:NilClass (NoMethodError) from ../../vmm/kvm/poll:152:in `block in get_all_vm_info' from ../../vmm/kvm/poll:134:in `each' from ../../vmm/kvm/poll:134:in `get_all_vm_info' from /var/tmp/one/vmm/lib/poll_common.rb:99:in `print_all_vm_template' from ../../vmm/kvm/poll:531:in `' ERROR MESSAGE --8<------ Error executing poll.sh ERROR MESSAGE ------>8-- ERROR MESSAGE --8<------ Error executing collectd-client_control.sh ERROR MESSAGE ------>8-- ARCH=x86_64 MODELNAME="Intel(R) Xeon(R) Platinum 8168 CPU @ 2.70GHz" HYPERVISOR=kvm TOTALCPU=9600 CPUSPEED=2700 TOTALMEMORY=791014980 USEDMEMORY=78110216 FREEMEMORY=712904764 FREECPU=9312 USEDCPU=288 NETRX=221262088274 NETTX=120226307714 KVM_MACHINES="pc-i440fx-rhel7.5.0 pc pc-i440fx-rhel7.0.0 rhel6.3.0 rhel6.4.0 rhel6.0.0 pc-i440fx-rhel7.1.0 pc-i440fx-rhel7.2.0 pc-q35-rhel7.3.0 rhel6.5.0 pc-q35-rhel7.4.0 rhel6.6.0 rhel6.1.0 rhel6.2.0 pc-i440fx-rhel7.3.0 pc-i440fx-rhel7.4.0 pc-q35-rhel7.5.0 q35" KVM_CPU_MODELS="486 pentium pentium2 pentium3 pentiumpro coreduo n270 core2duo qemu32 kvm32 cpu64-rhel5 cpu64-rhel6 kvm64 qemu64 Conroe Penryn Nehalem Nehalem-IBRS Westmere Westmere-IBRS SandyBridge SandyBridge-IBRS IvyBridge IvyBridge-IBRS Haswell-noTSX Haswell-noTSX-IBRS Haswell Haswell-IBRS Broadwell-noTSX Broadwell-noTSX-IBRS Broadwell Broadwell-IBRS Skylake-Client Skylake-Client-IBRS Skylake-Server Skylake-Server-IBRS athlon phenom Opteron_G1 Opteron_G2 Opteron_G3 Opteron_G4 Opteron_G5 EPYC EPYC-IBPB" DS_LOCATION_USED_MB=3542 DS_LOCATION_TOTAL_MB=9952 DS_LOCATION_FREE_MB=5883 DS = [ ID = 100, USED_MB = 3542, TOTAL_MB = 9952, FREE_MB = 5883 ] HOSTNAME=ord-virt-004
Current results:
after 3 failed attempt, the fencing mechanism is triggered and host is rebooted
Expected results:
do not detect host as in error