Running into errors, when a Hook is executed

Hello, I’m getting the following error into the oned.log when trying to execute the host_error hook:

Command execution failed (exit code: 255): /var/lib/one/remotes/datastore/ceph/monitor PERTX0RSSVZF.....some big encoded stuff
[Z0][ImM][E]: Error monitoring datastore 102: LQ==. Decoded info: -

My OpenNebula Cluster consists of 4 Machines, with one beeing the Master (Frontend) and the other 3 are the Nodes and their own ceph-Cluster. Creating Machines onto the Ceph-Datastore works. Live-Migrating a VM from one to another node works as well.

But when I turn for example Node-2 off, which has a VM running on it, the VM does not get redeployed onto another Node unless I choose the “Reschedule” Option in Sunstone. It also takes some time till Sunstone recognizes that a Node is down and the State of the VM is UNKNOWN.

I used the existing host_error Template and modified the cycle from 5 to 2.

ARGUMENTS = "$TEMPLATE -m -p 2"
COMMAND   = "/var/lib/one/remotes/hooks/ft/host_error.rb"
NAME      = "host_error"
STATE     = "ERROR"
REMOTE    = "no"
RESOURCE  = HOST
TYPE      = state

From the onehem.log File:

BYZW9uKFIpIEUtMjE3NkcgQ1BVIEAgMy43MEdIel1dPjwvTU9ERUxOQU1FPjxOQU1FPjwhW0NEQVRBW25vZGUxXV0+PC9OQU1FPjxSRVNFUlZFRF9DUFU+PCFbQ0RBVEFbXV0+PC9SRVNFUlZFRF9DUFU+PFJFU0VSVkVEX01FTT48IVtDREFUQVtdXT48L1JFU0VSVkVEX01FTT48VkVSU0lPTj48IVtDREFUQVs2LjAuMC4yXV0+PC9WRVJTSU9OPjxWTV9NQUQ+PCFbQ0RBVEFba3ZtXV0+PC9WTV9NQUQ+PC9URU1QTEFURT48TU9OSVRPUklORy8+PC9IT1NUPjwvSE9PS19NRVNTQUdFPg==
Tue Sep 21 13:12:06 2021 [I]: Executing hook 6 for HOST/ERROR/
Tue Sep 21 13:14:06 2021 [E]: Failure executing hook 6 for HOST/ERROR/

Onehook show provides following information:

onehook show 6 -e 0
HOOK 6 INFORMATION
ID                : 6
NAME              : host_error
TYPE              : state
LOCK              : None

HOOK EXECUTION RECORD
EXECUTION ID      : 0
TIMESTAMP         : 09/21 13:14:06
COMMAND           : /var/lib/one/remotes/hooks/ft/host_error.rb
ARGUMENTS         : <HOST>
  <ID>0</ID>
  <NAME>node1</NAME>
  <STATE>3</STATE>
  <PREV_STATE>2</PREV_STATE>
  <IM_MAD><![CDATA[kvm]]></IM_MAD>
  <VM_MAD><![CDATA[kvm]]></VM_MAD>
  <CLUSTER_ID>0</CLUSTER_ID>
  <CLUSTER>default</CLUSTER>
  <HOST_SHARE>
    <MEM_USAGE>8388608</MEM_USAGE>
    <CPU_USAGE>100</CPU_USAGE>
    <TOTAL_MEM>65659868</TOTAL_MEM>
    <TOTAL_CPU>1200</TOTAL_CPU>
    <MAX_MEM>65659868</MAX_MEM>
    <MAX_CPU>1200</MAX_CPU>
    <RUNNING_VMS>1</RUNNING_VMS>
    <VMS_THREAD>1</VMS_THREAD>
    <DATASTORES>
      <DISK_USAGE><![CDATA[0]]></DISK_USAGE>
      <FREE_DISK><![CDATA[49113]]></FREE_DISK>
      <MAX_DISK><![CDATA[51175]]></MAX_DISK>
      <USED_DISK><![CDATA[2063]]></USED_DISK>
    </DATASTORES>
    <PCI_DEVICES/>
    <NUMA_NODES>
      <NODE>
        <CORE>
          <CPUS><![CDATA[0:-1,6:-1]]></CPUS>
          <DEDICATED><![CDATA[NO]]></DEDICATED>
          <FREE><![CDATA[2]]></FREE>
          <ID><![CDATA[0]]></ID>
        </CORE>
        <CORE>
          <CPUS><![CDATA[1:-1,7:-1]]></CPUS>
          <DEDICATED><![CDATA[NO]]></DEDICATED>
          <FREE><![CDATA[2]]></FREE>
          <ID><![CDATA[1]]></ID>
        </CORE>
        <CORE>
          <CPUS><![CDATA[2:-1,8:-1]]></CPUS>
          <DEDICATED><![CDATA[NO]]></DEDICATED>
          <FREE><![CDATA[2]]></FREE>
          <ID><![CDATA[2]]></ID>
        </CORE>
        <CORE>
          <CPUS><![CDATA[3:-1,9:-1]]></CPUS>
          <DEDICATED><![CDATA[NO]]></DEDICATED>
          <FREE><![CDATA[2]]></FREE>
          <ID><![CDATA[3]]></ID>
        </CORE>
        <CORE>
          <CPUS><![CDATA[4:-1,10:-1]]></CPUS>
          <DEDICATED><![CDATA[NO]]></DEDICATED>
          <FREE><![CDATA[2]]></FREE>
          <ID><![CDATA[4]]></ID>
        </CORE>
        <CORE>
          <CPUS><![CDATA[5:-1,11:-1]]></CPUS>
          <DEDICATED><![CDATA[NO]]></DEDICATED>
          <FREE><![CDATA[2]]></FREE>
          <ID><![CDATA[5]]></ID>
        </CORE>
        <HUGEPAGE>
          <FREE><![CDATA[0]]></FREE>
          <PAGES><![CDATA[0]]></PAGES>
          <SIZE><![CDATA[1048576]]></SIZE>
          <USAGE><![CDATA[0]]></USAGE>
        </HUGEPAGE>
        <HUGEPAGE>
          <FREE><![CDATA[0]]></FREE>
          <PAGES><![CDATA[0]]></PAGES>
          <SIZE><![CDATA[2048]]></SIZE>
          <USAGE><![CDATA[0]]></USAGE>
        </HUGEPAGE>
        <MEMORY>
          <DISTANCE><![CDATA[0]]></DISTANCE>
          <FREE><![CDATA[0]]></FREE>
          <TOTAL><![CDATA[65659868]]></TOTAL>
          <USAGE><![CDATA[0]]></USAGE>
          <USED><![CDATA[0]]></USED>
        </MEMORY>
        <NODE_ID><![CDATA[0]]></NODE_ID>
      </NODE>
    </NUMA_NODES>
  </HOST_SHARE>
  <VMS>
    <ID>2</ID>
  </VMS>
  <TEMPLATE>
    <ARCH><![CDATA[x86_64]]></ARCH>
    <CLUSTER_ID><![CDATA[0]]></CLUSTER_ID>
    <CPUSPEED><![CDATA[847]]></CPUSPEED>
    <ERROR><![CDATA[Tue Sep 21 13:12:06 2021 : Error monitoring Host node1 (0): ]]></ERROR>
    <HOSTNAME><![CDATA[node1]]></HOSTNAME>
    <HYPERVISOR><![CDATA[kvm]]></HYPERVISOR>
    <IM_MAD><![CDATA[kvm]]></IM_MAD>
    <KVM_CPU_MODEL><![CDATA[Skylake-Client-IBRS]]></KVM_CPU_MODEL>
    <KVM_CPU_MODELS><![CDATA[486 pentium pentium2 pentium3 pentiumpro coreduo n270 core2duo qemu32 kvm32 cpu64-rhel5 cpu64-rhel6 kvm64 qemu64 Conroe Penryn Nehalem Nehalem-IBRS Westmere Westmere-IBRS SandyBridge SandyBridge-IBRS IvyBridge IvyBridge-IBRS Haswell-noTSX Haswell-noTSX-IBRS Haswell Haswell-IBRS Broadwell-noTSX Broadwell-noTSX-IBRS Broadwell Broadwell-IBRS Skylake-Client Skylake-Client-IBRS Skylake-Server Skylake-Server-IBRS Icelake-Client Icelake-Server athlon phenom Opteron_G1 Opteron_G2 Opteron_G3 Opteron_G4 Opteron_G5 EPYC EPYC-IBPB]]></KVM_CPU_MODELS>
    <KVM_MACHINES><![CDATA[pc-i440fx-rhel7.0.0 pc rhel6.0.0 rhel6.1.0 rhel6.2.0 rhel6.3.0 rhel6.4.0 rhel6.5.0 rhel6.6.0]]></KVM_MACHINES>
    <MODELNAME><![CDATA[Intel(R) Xeon(R) E-2176G CPU @ 3.70GHz]]></MODELNAME>
    <NAME><![CDATA[node1]]></NAME>
    <RESERVED_CPU><![CDATA[]]></RESERVED_CPU>
    <RESERVED_MEM><![CDATA[]]></RESERVED_MEM>
    <VERSION><![CDATA[6.0.0.2]]></VERSION>
    <VM_MAD><![CDATA[kvm]]></VM_MAD>
  </TEMPLATE>
  <MONITORING/>
</HOST> -m -p 2
EXIT CODE         : 255

I hope that someone can help me out :slight_smile:

Thanks in Advance

I got a bit further…

I added a new error_hook with following parameters and options:

ARGUMENTS = "$TEMPLATE -m -p 2"
COMMAND   = "ft/host_error.rb"
ARGUMENTS_STDIN = "yes"
NAME      = "host_error_1"
STATE     = ACTIVE
REMOTE    = "no"
RESOURCE  = VM
TYPE      = state
ON        = CUSTOM
LCM_STATE = UNKNOWN

onehem.log:

...JTUU+PEVTVElNRT4wPC9FU1RJTUU+PEVFVElNRT4wPC9FRVRJTUU+PEFDVElPTj4wPC9BQ1RJT04+PFVJRD4tMTwvVUlEPjxHSUQ+LTE8L0dJRD48UkVRVUVTVF9JRD4tMTwvUkVRVUVTVF9JRD48L0hJU1RPUlk+PC9ISVNUT1JZX1JFQ09SRFM+PC9WTT48L0hPT0tfTUVTU0FHRT4=
Wed Sep 22 10:22:49 2021 [I]: Executing hook 7 for VM/ACTIVE/UNKNOWN
Wed Sep 22 10:24:49 2021 [I]: Hook 7 successfully executed for VM/ACTIVE/UNKNOWN

onehook show:

</VM> -m -p 2
EXIT CODE         : 0
EXECUTION STDOUT
EXECUTION STDERR

But still the VM is not beeing migrated to another Host. I turned the Host on which the VM was running off. On the OpenNebula Frontend the State remains UNKNOWN.

I just want that the VM is migrated automatically, as soon as the original Host is down for whatever reason.