Please, describe the problem here and provide additional information below (if applicable) …
Hello,
While doing HA VM test, we did a hard reboot of a host to see the VM migrate to another host, this worked fine but now the host is in ERROR state.
We tried reinstalling Opennebula rpm and no luck, we removed the host and re-added it again, still no luck.
We also tried onehost sync --force but that made no difference.
in oned.log we see this:
Wed Aug 21 10:59:50 2019 [Z0][InM][I]: Command execution failed (exit code: 134): ‘if [ -x “/var/tmp/one/im/run_probes” ]; then /var/tmp/one/im/run_probes kvm /var/lib/one//datastores 4124 60 5 HV1; else exit 42; fi’
Wed Aug 21 10:59:50 2019 [Z0][InM][I]: /var/tmp/one/im/run_probes: line 34: 17851 Aborted ./$i $ARGUMENTS
Wed Aug 21 10:59:50 2019 [Z0][InM][E]: Error executing collectd-client.rb
Wed Aug 21 10:59:53 2019 [Z0][InM][I]: Command execution failed (exit code: 134): ‘if [ -x “/var/tmp/one/im/run_probes” ]; then /var/tmp/one/im/run_probes kvm /var/lib/one//datastores 4124 60 5 HV1; else exit 42; fi’
Wed Aug 21 10:59:53 2019 [Z0][InM][I]: /var/tmp/one/im/run_probes: line 34: 18360 Aborted ./$i $ARGUMENTS
Wed Aug 21 10:59:53 2019 [Z0][InM][E]: Error executing collectd-client.rb
Our setup consists of FC storage backend with cLVM and GFS2 filesystem; two hosts and one frontend.
Besides that there is no other visible error.
We see UDP communication back and forth between the frontend and the host, meaning they are talking to each other but some probe script is failing in the host side.
Any suggestions?
Versions of the related components and OS (frontend, hypervisors, VMs):
OpenNebula 5.8.1
Steps to reproduce:
Hard reboot the host.
Current results:
Host in ERROR state, sometimes we see RETRY state and then back to ERROR.
Expected results:
Status=OK