Host failure handling interval

Hello,
I have setup with OpenNebula with this configuration:
3 hosts (one of this node running oned) with Ubuntu Server 14.04 LTS + KVM as hypervisor + openvswitch as network
Storage is ceph.

And I have some issues with host failure detection.

My testing case is deploy vm to one of the host, after vm succesfully booted, I’m going to host, when it running, and just shutdown this host (by running poweroff command).

after that host changing state from ok to updated, and after long time (30-40 mins) host going to retry.

How can I decrease interval for host goes from udated to retry / error state after it was shut down?

UPD:

Timers configuration:

MANAGER_TIMER = 5

MONITORING_INTERVAL = 10
MONITORING_THREADS  = 50

HOST_PER_INTERVAL               = 15
HOST_MONITORING_EXPIRATION_TIME = 1800
#HOST_MONITORING_EXPIRATION_TIME = 43200

#VM_INDIVIDUAL_MONITORING      = "no"
VM_PER_INTERVAL               = 30
VM_MONITORING_EXPIRATION_TIME = 1800
#VM_MONITORING_EXPIRATION_TIME = 14400

IM Configuration:

IM_MAD = [
      name       = "collectd",
      executable = "collectd",
      arguments  = "-p 4124 -f 2 -t 50 -i 5" ]

IM_MAD = [
      name       = "kvm",
      executable = "one_im_ssh",
      arguments  = "-r 3 -t 15 kvm" ]

RPC configuration:

MAX_CONN           = 50
MAX_CONN_BACKLOG   = 50
KEEPALIVE_TIMEOUT  = 15
KEEPALIVE_MAX_CONN = 50
TIMEOUT            = 15
RPC_LOG            = NO
#MESSAGE_SIZE       = 1073741824
#LOG_CALL_FORMAT    = "Req:%i UID:%u %m invoked %l"

How can I achieve reaction on host failure in 1-2 minute interval?

Well, the default interval is 15 minutes, so you can simply change that to 2 minutes. But you should remember that, in the event there is network interruption / lagging, this will create a false failure and will re-create/delete the VMs based on the VM hook you define.

Hosts are monitored through the collectd probe, which sends information every 5 seconds in your conf (-i argument in collectd IM_MAD).

If every things works, your hosts will be monitored and information updated every 5 seconds, however if no information is received during a MONITOR_INTERVAL a pro-active probe is sent (one that execute and restarts collectd).

When this last action fails the host is moved to error state. You should check the logs and timestamps to verify this. I guess that the last step is taking a long time to timeout although 30 min is way too much…

Cheers

Thank you for detailed explanation.

30 min interval was fixed by adding this lines to oneadmin ssh config:

ConnectTimeout 3
ConnectionAttempts 1

It will be great to add this to official documentation

Make sense, issue is here

http://dev.opennebula.org/issues/3685

Thanks

Ruben