Host failure handling interval

RaveNoX · February 26, 2015, 10:33am

Hello,
I have setup with OpenNebula with this configuration:
3 hosts (one of this node running oned) with Ubuntu Server 14.04 LTS + KVM as hypervisor + openvswitch as network
Storage is ceph.

And I have some issues with host failure detection.

My testing case is deploy vm to one of the host, after vm succesfully booted, I’m going to host, when it running, and just shutdown this host (by running poweroff command).

after that host changing state from ok to updated, and after long time (30-40 mins) host going to retry.

How can I decrease interval for host goes from udated to retry / error state after it was shut down?

UPD:

Timers configuration:

MANAGER_TIMER = 5

MONITORING_INTERVAL = 10
MONITORING_THREADS  = 50

HOST_PER_INTERVAL               = 15
HOST_MONITORING_EXPIRATION_TIME = 1800
#HOST_MONITORING_EXPIRATION_TIME = 43200

#VM_INDIVIDUAL_MONITORING      = "no"
VM_PER_INTERVAL               = 30
VM_MONITORING_EXPIRATION_TIME = 1800
#VM_MONITORING_EXPIRATION_TIME = 14400

IM Configuration:

IM_MAD = [
      name       = "collectd",
      executable = "collectd",
      arguments  = "-p 4124 -f 2 -t 50 -i 5" ]

IM_MAD = [
      name       = "kvm",
      executable = "one_im_ssh",
      arguments  = "-r 3 -t 15 kvm" ]

RPC configuration:

MAX_CONN           = 50
MAX_CONN_BACKLOG   = 50
KEEPALIVE_TIMEOUT  = 15
KEEPALIVE_MAX_CONN = 50
TIMEOUT            = 15
RPC_LOG            = NO
#MESSAGE_SIZE       = 1073741824
#LOG_CALL_FORMAT    = "Req:%i UID:%u %m invoked %l"

How can I achieve reaction on host failure in 1-2 minute interval?

anandharaj · February 27, 2015, 2:30am

Well, the default interval is 15 minutes, so you can simply change that to 2 minutes. But you should remember that, in the event there is network interruption / lagging, this will create a false failure and will re-create/delete the VMs based on the VM hook you define.

ruben · March 11, 2015, 9:26am

Hosts are monitored through the collectd probe, which sends information every 5 seconds in your conf (-i argument in collectd IM_MAD).

If every things works, your hosts will be monitored and information updated every 5 seconds, however if no information is received during a MONITOR_INTERVAL a pro-active probe is sent (one that execute and restarts collectd).

When this last action fails the host is moved to error state. You should check the logs and timestamps to verify this. I guess that the last step is taking a long time to timeout although 30 min is way too much…

Cheers

RaveNoX · March 16, 2015, 3:12pm

Thank you for detailed explanation.

30 min interval was fixed by adding this lines to oneadmin ssh config:

ConnectTimeout 3
ConnectionAttempts 1

It will be great to add this to official documentation

ruben · March 16, 2015, 4:57pm

Make sense, issue is here

http://dev.opennebula.org/issues/3685

Thanks

Ruben

Topic		Replies	Views
Monitoring hosts Community Support	5	2072	February 5, 2017
Where can set MONITORING_INTERVAL_HOST? Community Support	4	526	March 24, 2021
Changing VM monitoring interval Community Support	2	360	October 30, 2020
Solved: "Error monitoring Host" when trying to add host General solved	1	5489	January 29, 2019
VM fails to (HA) reschedule on host error - OpenNebula 5.0.1 Community Support	3	1316	July 14, 2016

Host failure handling interval

Related topics