Opennebula scheduler keeps dying, one4.8

I have an OpenNebula 4.8 installation that has been stable since June of 2015.
Over the last few months I have seen that the /usr/bin/mm_sched tends to hang. The symptoms
are that after a certain time there are no entries in the sched.log and new vms that are submitted
are just stuck in pending state. There are no core dumps and the mm_sched daemon keeps running, it
just does not do anything until I restart the opennebula service, which restarts oned and the mm_sched.

Has anyone seen this behavior before? I thought this could be due to large numbers of vms
being launched by the econe-server interface but I have also seen it happen when there was no
big load on the system. Also, it tends to happen overnight.

Probably you are being hit by this one:

http://dev.opennebula.org/issues/3390

which also has its “overnight” version http://dev.opennebula.org/issues/4284
:wink:

It seems related to an issue with the xmlrpc client, we have re-written the
client logic to use the advance functionality to prevent this bug. This is
not an easy back port so I am not sure if it will be in 4.x branch in the
short-term.

Meanwhile, note that the scheduler is totally stateless and you can restart
the process in a cron-like job.

Hi Ruben
Yes, I checked my oned logs and I see the same key message

Tue May 17 03:30:55 2016 [Z0][InM][E]: Information driver crashed, recovering…

at about the time the mm_sched stopped functioning. so I think it is fair to say that we are dealing with exactly the same issue here. Thanks for the explanation. when do we expect the 5.0 beta to be ready?

Steve Timm

hopefully tomorrow

1 Like

One other thing on this: I back-checked my logs for "Information driver crashed, recovering"
and found that in almost all cases it was happening at the bottom of the hour, at the the time when
my hourly mysqldb backup is happening, and it is not uncommon for various operations to take a little while during that tuime.