Major issue: error from service: CheckAuthorization

Yannick_MOLINET · August 22, 2016, 9:58pm

Hi all,

After some week of test with no major issue, we have pass our OpenNebula in production this week end.
We have 36 VM on 3 hosts.
Today, we have problems:
Each host was detected by one as done but OS is correctly running and be accessible by ssh.
We have this error in syslog

Aug 22 22:13:58 adnpvirt07 libvirtd[2117]: error from service: CheckAuthorization: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken.
Aug 22 22:13:58 adnpvirt07 libvirtd[2117]: End of file while reading data: Erreur d’entrée/sortie

oneadmin is membrer of libvrt group.
this issue cause corruption of the running vdisk and cause some problem with our production.
in addition, (HOOK on error host is enable to activate HA for our hosts)

After rebooting each host (and apply an apt-get upgrade), it seems to be good, but I want to understand where is the problem to fix it.

We are using OpenNebula 5.0.2

Thanks for helps,
Yannick

Yannick_MOLINET · August 23, 2016, 6:51am

The image corruption seems due because the one monitoring system detect the host as done, but VM are always running on it. So One restart VM on another hosts, but image disk are already in use… I think a fencing method must be use on host monitoring failure to avoid this kind of failure.

Yannick_MOLINET · August 23, 2016, 7:35am

I try to apply this

Yannick_MOLINET · August 23, 2016, 7:43am

a feature is already open to be able to fence a device on host hook : http://dev.opennebula.org/issues/4659

atodorov_storpool · August 24, 2016, 9:00am

Hi Yannick,

I’ve extracted the fencing part of the FT host-hook in our addon to the following patch: host_error.rb.patch (1.4 KB)

It is adding additional argument ‘-s’ to the host_hook.rb:

host_error.rb <other-args> -s /usr/sbin/myfencing-script.sh

Then on host failure event and if there are VM’s on it the host_error.rb will call the pointed script providing the hostname in the FT_HOSTNAME environment variable. Your /usr/sbin/myfencing-script.sh should be something like:

#!/bin/bash

PATH=/sbin:/usr/sbin:/bin:/usr/bin:$PATH

ipmitool ... -H $FT_HOSTNAME.ipmi.fqdn chassis power off

logger -t ${0##*/} "fence $FT_HOSTNAME $?"

Keep in mind that it will be called as oneadmin user…

Hope this helps,

Anton Todorov

Yannick_MOLINET · August 24, 2016, 9:05am

Hi,

could you explain how to apply this patch ?
In addition, $FT_HOSTNAME is the host name that failed, is it correct ?
So, I can use it to manage (with case instruction) different fence method.

Thanks,
Yannick

atodorov_storpool · August 24, 2016, 9:17am

Hi Yannick,

To apply the patch use the following (adjust thepath if they differ from defaults)

cd /var/lib/one/remotes/hooks/ft
cp host_error.rb host_error.rb.orig
patch -p0 < /path/to/host_hook.rb.patch

Yes. The $FT_HOSTNAME is the host name of the failed node as it is seen in opennebula.

Correct. Feel free to change the script to fit your needs.

Kind Regards,
Anton Todorov

atodorov_storpool · August 24, 2016, 9:28am

I think there is additional patch to be sure that host_hook is not missing an failed node

# If the host came back, exit! avoid duplicated VMs
exit 0 if host.state != 3

change to

# If the host came back, exit! avoid duplicated VMs
exit 0 if host.state != 3 and host.state != 5

where host.state = 5 is MONITORING_ERROR (Monitoring the host from error state)

When host fails its state changes to ERROR and the host-hook is triggered. But if you enable --pause the host will swap between ERROR and MONITORING_ERROR. Without the above change, if the host-hook waits some time and check the host it is possible the host state to be MONITORING_ERROR and it will decide (IMO wrongly) that the host is back and no action is done.

Kind Regards,
Anton Todorov

Yannick_MOLINET · August 24, 2016, 5:56pm

I can not test this on my production servers so we must wait to add new servers in the cluster (in the next month).
I think to dev a bash script, with:

an array with server hostname as key and fencing command as value.
a request on the array to execute the correct command.
some log message and test (as check the exe ou script invoked).

Yannick

Yannick_MOLINET · August 26, 2016, 12:47pm

This error is back.
it’s a monitoring problem. I have disable host error hook, so VM are always running on the host and they works as fine.
The change do in /etc/libvrt/libvirt.conf on my debian 8 don’t solved the issue.

any help to fix this issue is welcome.

Thansk,
Yannick

Yannick_MOLINET · August 26, 2016, 12:55pm

I’m wrong.
I don’t have apply the setting in the correct files
I’m have apply this in /etc/libvirt/libvirt.conf and not in /etc/libvirt/libvirtd.conf

denis-ldv · August 27, 2016, 7:25pm

Just to be sure you can use libvirt disk locking mechanism
https://libvirt.org/locking.html

jmelis · September 7, 2016, 10:21am

Hi Anton,

Thanks for your contribution and patch. I improved the hook, so your patch doesn’t apply exactly, but I borrowed the same fundamental idea.

https://github.com/OpenNebula/one/blob/master/share/hooks/host_error.rb
https://github.com/OpenNebula/one/blob/master/share/hooks/fence_host.sh

Closing #4659

Topic		Replies	Views
Unknown machines and error hosts after update Community Support	3	2235	January 28, 2016
Hosts entering the ERROR state Community Support	10	5652	February 4, 2019
Cannot monitor VM status and Host with error Community Support	1	1879	February 5, 2018
Solved: "Error monitoring Host" when trying to add host General solved	1	5518	January 29, 2019
VM reschedule pending Community Support	1	847	July 23, 2016

Major issue: error from service: CheckAuthorization

Related topics