After some week of test with no major issue, we have pass our OpenNebula in production this week end.
We have 36 VM on 3 hosts.
Today, we have problems:
Each host was detected by one as done but OS is correctly running and be accessible by ssh.
We have this error in syslog
Aug 22 22:13:58 adnpvirt07 libvirtd[2117]: error from service: CheckAuthorization: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken.
Aug 22 22:13:58 adnpvirt07 libvirtd[2117]: End of file while reading data: Erreur d’entrée/sortie
oneadmin is membrer of libvrt group.
this issue cause corruption of the running vdisk and cause some problem with our production.
in addition, (HOOK on error host is enable to activate HA for our hosts)
After rebooting each host (and apply an apt-get upgrade), it seems to be good, but I want to understand where is the problem to fix it.
The image corruption seems due because the one monitoring system detect the host as done, but VM are always running on it. So One restart VM on another hosts, but image disk are already in use… I think a fencing method must be use on host monitoring failure to avoid this kind of failure.
Then on host failure event and if there are VM’s on it the host_error.rb will call the pointed script providing the hostname in the FT_HOSTNAME environment variable. Your /usr/sbin/myfencing-script.sh should be something like:
#!/bin/bash
PATH=/sbin:/usr/sbin:/bin:/usr/bin:$PATH
ipmitool ... -H $FT_HOSTNAME.ipmi.fqdn chassis power off
logger -t ${0##*/} "fence $FT_HOSTNAME $?"
Keep in mind that it will be called as oneadmin user…
could you explain how to apply this patch ?
In addition, $FT_HOSTNAME is the host name that failed, is it correct ?
So, I can use it to manage (with case instruction) different fence method.
I think there is additional patch to be sure that host_hook is not missing an failed node
# If the host came back, exit! avoid duplicated VMs
exit 0 if host.state != 3
change to
# If the host came back, exit! avoid duplicated VMs
exit 0 if host.state != 3 and host.state != 5
where host.state = 5 is MONITORING_ERROR (Monitoring the host from error state)
When host fails its state changes to ERROR and the host-hook is triggered. But if you enable --pause the host will swap between ERROR and MONITORING_ERROR. Without the above change, if the host-hook waits some time and check the host it is possible the host state to be MONITORING_ERROR and it will decide (IMO wrongly) that the host is back and no action is done.
I can not test this on my production servers so we must wait to add new servers in the cluster (in the next month).
I think to dev a bash script, with:
an array with server hostname as key and fencing command as value.
a request on the array to execute the correct command.
some log message and test (as check the exe ou script invoked).
This error is back.
it’s a monitoring problem. I have disable host error hook, so VM are always running on the host and they works as fine.
The change do in /etc/libvrt/libvirt.conf on my debian 8 don’t solved the issue.