We had this problem with 5.10.x version, but this week we updated to 5.12.0.3. The problem is that after a few days, the hooks stop working. A restart of opennebula-hem remedies the situation.
What’s going on there?
onehem.log doesn’t say much. Just stops suddenly:
Tue Dec 15 11:41:34 2020 [I]: Executing hook 4 for VM/DONE/LCM_INIT
Tue Dec 15 11:41:34 2020 [I]: Executing hook 2 for VM/DONE/LCM_INIT
Tue Dec 15 11:41:34 2020 [I]: Executing hook 1 for VM/DONE/LCM_INIT
Tue Dec 15 11:41:34 2020 [I]: Hook 2 successfully executed for VM/DONE/LCM_INIT
Tue Dec 15 11:41:34 2020 [I]: Hook 1 successfully executed for VM/DONE/LCM_INIT
Tue Dec 15 11:41:34 2020 [I]: Hook 4 successfully executed for VM/DONE/LCM_INIT
Tue Dec 15 11:41:36 2020 [I]: Executing hook 4 for VM/DONE/LCM_INIT
Tue Dec 15 11:41:36 2020 [I]: Hook 4 successfully executed for VM/DONE/LCM_INIT
oned.log just says that it crashed… so at least it is detected:
Tue Dec 15 11:40:59 2020 [Z0][SQL][W]: Slow query (0.53s) detected: SELECT host_monitoring.body FROM host_monitoring INNER JOIN host_pool ON hid = oid WHERE (host_monitoring.last_mon_time = (SELECT MAX(last_mon_time) FROM host_monitoring AS t WHERE t.hid = host_monitoring.hid)) ORDER BY hid, host_monitoring.last_mon_time;
Tue Dec 15 11:41:32 2020 [Z0][HKM][E]: Hook driver crashed, recovering…
Tue Dec 15 11:41:33 2020 [Z0][SQL][W]: Slow query (0.53s) detected: SELECT host_monitoring.body FROM host_monitoring INNER JOIN host_pool ON hid = oid WHERE (host_monitoring.last_mon_time = (SELECT MAX(last_mon_time) FROM host_monitoring AS t WHERE t.hid = host_monitoring.hid)) ORDER BY hid, host_monitoring.last_mon_time;
Lately we’ve been working on improving the performance of the hook driver (one_hm) which is the one publishing the events that HEM consume. It would be nice if you can apply the patched version and check if the error keep appearing in your environment.
All between a few times a day to it being up a week. I just went and checked, and our simple event correlator has restarted opennebula-hem during the last week at some point. Sadly I haven’t done any timestamp logging there, so can’t say when that happened. I’ll add timestamping to it and inform you the next time this comes up
Seems we’ve had 2 crashes since last time (the 16th of March). We had a crash on the 19th of March and one on the 1st of April. So not that bad anymore, but still, my hook deserves its place
Hi @tosaraja, nice to hear that at least it have been improved. It would be useful if you can provide more information regarding the crashes so we can try to keep improve the hook subsystem.
For example it would be useful to know:
what crashed one-hem service, one_hm driver or both?
do you have some information regarding the load of the system at the crash moment?
Any other information you have and think that might be useful will be very welcome too.
Hello!
Same problem.
Installed in the test env opennebula 6.4.0 CE.
Hooks were created and tested to work.
Over some time to returned to the project, I’m trying to migrate the VM and the hooks do not work,
there are no entries in the logs at all about this.
After restarting opennebula-hem.service, I repeat the migration and everything works again.
The server has been running for 95 days with no errors such as out of memory.