VM stuck in RUNNING/POWEROFF cycle

Hi,

Some VMs get into a continual RUNNING/POWEROFF cycle in ON 4.12 (see excerpt form logs below). Previously we though it was only Windows machines but this not the case, it happens to any type.

This is very bad for Sunstone users as they cannot access the VNC in the POWEROFF state even though the VM is running and can be contacted via ssh.

There seems to be multiple instances of run_probes running on the host. Is this correct?

Any help would be much appreciated.

  Regards.
    Gerry

Fri Sep 30 17:32:39 2016 [Z0][LCM][I]: New VM state is RUNNING
Fri Sep 30 17:33:11 2016 [Z0][DiM][I]: New VM state is POWEROFF
Fri Sep 30 17:37:33 2016 [Z0][VMM][I]: VM found again, state is RUNNING
Fri Sep 30 17:37:33 2016 [Z0][LCM][I]: New VM state is RUNNING
Fri Sep 30 17:38:53 2016 [Z0][DiM][I]: New VM state is POWEROFF

Hi, it look like a problem with monitoring. Which monitoring type use? Udp or ssh? Why you use relatively old 4.12 version? In new versions was monitoring enhanced.

Hi Kristian,

As far as I’m aware we use ssh monitoring. Where can I look, apart from the individual machine log quoted above, to see what issues are being recorded? Are there timeout settings that can be modified somewhere?

We are using 4.12 as we are still running on Debian Wheezy. THis never happened on earlier versions

 Regards,
   Gerry

I’m afraid Kristian is rigth,there were a couple of issues with the state
transtion, (some race conditions between the driver callbacks and the
monitoring). I’d strongly suggest to upgrade to 5.0, if that involves too
much work, at least 4.14 address most of those issues.

Hello Ruben,

The reason we are stuck at 4.12 at the moment is that I believe that this is the higest version that will run on Debian Wheezy. Will 4.14 run on Wheezy?

We plan to migrate to Debian Jessie / ON 5. In the meantime, is there any workaround to this issue, e.e lengthening timeoute, etc?

 Regards,
   Gerry

Hello, I personally think, that there is no problem tu run latest version on wheezy too.

Try to update repo config and update.

echo "deb http://downloads.opennebula.org/repo/5.0/Debian/8 stable opennebula" > /etc/apt/sources.list.d/opennebula.list

but better will be to update to jessie. it is relatively simple and safe

I do it on several servers without problems

Hello Ruben,

Below is an example of a sub process from “ruby /usr/lib/one/mads/one_im_exec.rb -r 3 -t 15 kvm”. Am I correct in thinking that we are running in “UDP-push” mode? We have 120+ nodes so I think we should be running in this mode.

Are there any parameters we can tweek to avoid the race condition you mentioned until we get the opportunity to upgrade to 4.14 or 5. I know we can’t mix hosts running Debian Wheezy and Jessy as the kvm libvirt is different, but is it possible to run 4.14 on Wheezy?

This issue is causing real problems for users and we would like to put in a temp fix.

    Regards,
      Gerry

oneadmin 11897 12078 0 10:29 pts/1 00:00:00 sh -c ssh -n host128.X.Y.Z ‘if [ -x “/var/tmp/one/im/run_probes” ]; then /var/tmp/one/im/run_probes kvm /var/lib/one//datastores 4124 20 140 host128.X.Y.Z; else exit 42; fi’ ; echo ExitCode: $? 1>&2

Try to increase the interval

Here increase -i to some minutes e.g. 180

IM_MAD = [

      NAME       = "collectd",

      EXECUTABLE = "collectd",

      ARGUMENTS  = "-p 4124 -f 5 -t 50 -i 20" ]

And also:

MONITORING_INTERVAL = 240

This may helps sometimes…

Cheers