Hooks stop working and opennebula-hem needs a restart

We had this problem with 5.10.x version, but this week we updated to 5.12.0.3. The problem is that after a few days, the hooks stop working. A restart of opennebula-hem remedies the situation.

What’s going on there?

onehem.log doesn’t say much. Just stops suddenly:
Tue Dec 15 11:41:34 2020 [I]: Executing hook 4 for VM/DONE/LCM_INIT
Tue Dec 15 11:41:34 2020 [I]: Executing hook 2 for VM/DONE/LCM_INIT
Tue Dec 15 11:41:34 2020 [I]: Executing hook 1 for VM/DONE/LCM_INIT
Tue Dec 15 11:41:34 2020 [I]: Hook 2 successfully executed for VM/DONE/LCM_INIT
Tue Dec 15 11:41:34 2020 [I]: Hook 1 successfully executed for VM/DONE/LCM_INIT
Tue Dec 15 11:41:34 2020 [I]: Hook 4 successfully executed for VM/DONE/LCM_INIT
Tue Dec 15 11:41:36 2020 [I]: Executing hook 4 for VM/DONE/LCM_INIT
Tue Dec 15 11:41:36 2020 [I]: Hook 4 successfully executed for VM/DONE/LCM_INIT

oned.log just says that it crashed… so at least it is detected:
Tue Dec 15 11:40:59 2020 [Z0][SQL][W]: Slow query (0.53s) detected: SELECT host_monitoring.body FROM host_monitoring INNER JOIN host_pool ON hid = oid WHERE (host_monitoring.last_mon_time = (SELECT MAX(last_mon_time) FROM host_monitoring AS t WHERE t.hid = host_monitoring.hid)) ORDER BY hid, host_monitoring.last_mon_time;
Tue Dec 15 11:41:32 2020 [Z0][HKM][E]: Hook driver crashed, recovering…
Tue Dec 15 11:41:33 2020 [Z0][SQL][W]: Slow query (0.53s) detected: SELECT host_monitoring.body FROM host_monitoring INNER JOIN host_pool ON hid = oid WHERE (host_monitoring.last_mon_time = (SELECT MAX(last_mon_time) FROM host_monitoring AS t WHERE t.hid = host_monitoring.hid)) ORDER BY hid, host_monitoring.last_mon_time;

I’m not running with debugs enabled though…

Hello @tosaraja,

Lately we’ve been working on improving the performance of the hook driver (one_hm) which is the one publishing the events that HEM consume. It would be nice if you can apply the patched version and check if the error keep appearing in your environment.

1 Like

Sure thing.I’ll grab it and apply it soonish (depends on when I want to cause a small break to the CI)

And it’s running. I initially created a new file and it had to be given +x rights, which I forgot.

Nice! Thanks for your collaboration. Let us know if the improvements solve the issue.

So far so good. Hasn’t crashed yet.

1 Like

Thanks for the feedback! Just out of curiosity do you know the frequency with which the error appeared before applying the improved version?

All between a few times a day to it being up a week. I just went and checked, and our simple event correlator has restarted opennebula-hem during the last week at some point. Sadly I haven’t done any timestamp logging there, so can’t say when that happened. I’ll add timestamping to it and inform you the next time this comes up :slight_smile:

1 Like

Seems we’ve had 2 crashes since last time (the 16th of March). We had a crash on the 19th of March and one on the 1st of April. So not that bad anymore, but still, my hook deserves its place :wink:

Hi @tosaraja, nice to hear that at least it have been improved. It would be useful if you can provide more information regarding the crashes so we can try to keep improve the hook subsystem.

For example it would be useful to know:

  • what crashed one-hem service, one_hm driver or both?
  • do you have some information regarding the load of the system at the crash moment?

Any other information you have and think that might be useful will be very welcome too.

Hello!
Same problem.
Installed in the test env opennebula 6.4.0 CE.
Hooks were created and tested to work.
Over some time to returned to the project, I’m trying to migrate the VM and the hooks do not work,
there are no entries in the logs at all about this.
After restarting opennebula-hem.service, I repeat the migration and everything works again.
The server has been running for 95 days with no errors such as out of memory.

Add:
01.05.2023:

#ss -antp|grep :21                                                                                                                                                                                                                             
LISTEN    0      100        127.0.0.1:2101        0.0.0.0:*     users:(("one_hm.rb",pid=953,fd=39))                                                                                                                                           
LISTEN    0      100        127.0.0.1:2102        0.0.0.0:*     users:(("one_hm.rb",pid=953,fd=41))                                                                                                                                           
ESTAB     0      0          127.0.0.1:59804     127.0.0.1:2101  users:(("ruby",pid=1322,fd=15))                                                                                                                                               
ESTAB     0      0          127.0.0.1:59816     127.0.0.1:2101  users:(("ruby",pid=1327,fd=19))                                                                                                                                               
ESTAB     0      0          127.0.0.1:58300     127.0.0.1:2102  users:(("ruby",pid=1322,fd=16))                                                                                                                                               
ESTAB     0      0          127.0.0.1:2101      127.0.0.1:59804 users:(("one_hm.rb",pid=953,fd=42))                                                                                                                                           
ESTAB     0      0          127.0.0.1:2102      127.0.0.1:58300 users:(("one_hm.rb",pid=953,fd=43))                                                                                                                                           
ESTAB     0      0          127.0.0.1:2101      127.0.0.1:59816 users:(("one_hm.rb",pid=953,fd=44))                                                                                                                                           

after 10 days without stop and start vm
11.05.2023:

#ss -antp|grep :21                                                                                                                                                                                                                             
LISTEN    0      100        127.0.0.1:2101        0.0.0.0:*     users:(("one_hm.rb",pid=953,fd=39))                                                                                                                                           
LISTEN    0      100        127.0.0.1:2102        0.0.0.0:*     users:(("one_hm.rb",pid=953,fd=41))                                                                                                                                           
ESTAB     0      0          127.0.0.1:59804     127.0.0.1:2101  users:(("ruby",pid=1322,fd=15))                                                                                                                                               
ESTAB     0      0          127.0.0.1:59816     127.0.0.1:2101  users:(("ruby",pid=1327,fd=19))                                                                                                                                               
ESTAB     0      0          127.0.0.1:58300     127.0.0.1:2102  users:(("ruby",pid=1322,fd=16))                                                                                                                                               
ESTAB     0      0          127.0.0.1:2102      127.0.0.1:58300 users:(("one_hm.rb",pid=953,fd=43))                                                                                                                                           
ESTAB     0      0          127.0.0.1:2101      127.0.0.1:59816 users:(("one_hm.rb",pid=953,fd=44))                                                                                                                                           

if compared outputs then one string absent and HEM not worked.
OOM not executed.