Hooks stop working and opennebula-hem needs a restart

tosaraja · December 16, 2020, 11:29am

We had this problem with 5.10.x version, but this week we updated to 5.12.0.3. The problem is that after a few days, the hooks stop working. A restart of opennebula-hem remedies the situation.

What’s going on there?

onehem.log doesn’t say much. Just stops suddenly:
Tue Dec 15 11:41:34 2020 [I]: Executing hook 4 for VM/DONE/LCM_INIT
Tue Dec 15 11:41:34 2020 [I]: Executing hook 2 for VM/DONE/LCM_INIT
Tue Dec 15 11:41:34 2020 [I]: Executing hook 1 for VM/DONE/LCM_INIT
Tue Dec 15 11:41:34 2020 [I]: Hook 2 successfully executed for VM/DONE/LCM_INIT
Tue Dec 15 11:41:34 2020 [I]: Hook 1 successfully executed for VM/DONE/LCM_INIT
Tue Dec 15 11:41:34 2020 [I]: Hook 4 successfully executed for VM/DONE/LCM_INIT
Tue Dec 15 11:41:36 2020 [I]: Executing hook 4 for VM/DONE/LCM_INIT
Tue Dec 15 11:41:36 2020 [I]: Hook 4 successfully executed for VM/DONE/LCM_INIT

oned.log just says that it crashed… so at least it is detected:
Tue Dec 15 11:40:59 2020 [Z0][SQL][W]: Slow query (0.53s) detected: SELECT host_monitoring.body FROM host_monitoring INNER JOIN host_pool ON hid = oid WHERE (host_monitoring.last_mon_time = (SELECT MAX(last_mon_time) FROM host_monitoring AS t WHERE t.hid = host_monitoring.hid)) ORDER BY hid, host_monitoring.last_mon_time;
Tue Dec 15 11:41:32 2020 [Z0][HKM][E]: Hook driver crashed, recovering…
Tue Dec 15 11:41:33 2020 [Z0][SQL][W]: Slow query (0.53s) detected: SELECT host_monitoring.body FROM host_monitoring INNER JOIN host_pool ON hid = oid WHERE (host_monitoring.last_mon_time = (SELECT MAX(last_mon_time) FROM host_monitoring AS t WHERE t.hid = host_monitoring.hid)) ORDER BY hid, host_monitoring.last_mon_time;

I’m not running with debugs enabled though…

cgonzalez · February 24, 2021, 11:34am

Hello @tosaraja,

Lately we’ve been working on improving the performance of the hook driver (one_hm) which is the one publishing the events that HEM consume. It would be nice if you can apply the patched version and check if the error keep appearing in your environment.

tosaraja · February 24, 2021, 1:38pm

Sure thing.I’ll grab it and apply it soonish (depends on when I want to cause a small break to the CI)

tosaraja · March 1, 2021, 7:04am

And it’s running. I initially created a new file and it had to be given +x rights, which I forgot.

cgonzalez · March 1, 2021, 9:12am

Nice! Thanks for your collaboration. Let us know if the improvements solve the issue.

tosaraja · March 4, 2021, 12:19pm

So far so good. Hasn’t crashed yet.

cgonzalez · March 16, 2021, 9:11am

Thanks for the feedback! Just out of curiosity do you know the frequency with which the error appeared before applying the improved version?

tosaraja · March 16, 2021, 10:26am

All between a few times a day to it being up a week. I just went and checked, and our simple event correlator has restarted opennebula-hem during the last week at some point. Sadly I haven’t done any timestamp logging there, so can’t say when that happened. I’ll add timestamping to it and inform you the next time this comes up

tosaraja · April 15, 2021, 12:06pm

Seems we’ve had 2 crashes since last time (the 16th of March). We had a crash on the 19th of March and one on the 1st of April. So not that bad anymore, but still, my hook deserves its place

cgonzalez · April 16, 2021, 8:09am

Hi @tosaraja, nice to hear that at least it have been improved. It would be useful if you can provide more information regarding the crashes so we can try to keep improve the hook subsystem.

For example it would be useful to know:

what crashed one-hem service, one_hm driver or both?
do you have some information regarding the load of the system at the crash moment?

Any other information you have and think that might be useful will be very welcome too.

kvabula · April 4, 2023, 8:18am

Hello!
Same problem.
Installed in the test env opennebula 6.4.0 CE.
Hooks were created and tested to work.
Over some time to returned to the project, I’m trying to migrate the VM and the hooks do not work,
there are no entries in the logs at all about this.
After restarting opennebula-hem.service, I repeat the migration and everything works again.
The server has been running for 95 days with no errors such as out of memory.

kvabula · May 12, 2023, 12:56pm

Add:
01.05.2023:

#ss -antp|grep :21                                                                                                                                                                                                                             
LISTEN    0      100        127.0.0.1:2101        0.0.0.0:*     users:(("one_hm.rb",pid=953,fd=39))                                                                                                                                           
LISTEN    0      100        127.0.0.1:2102        0.0.0.0:*     users:(("one_hm.rb",pid=953,fd=41))                                                                                                                                           
ESTAB     0      0          127.0.0.1:59804     127.0.0.1:2101  users:(("ruby",pid=1322,fd=15))                                                                                                                                               
ESTAB     0      0          127.0.0.1:59816     127.0.0.1:2101  users:(("ruby",pid=1327,fd=19))                                                                                                                                               
ESTAB     0      0          127.0.0.1:58300     127.0.0.1:2102  users:(("ruby",pid=1322,fd=16))                                                                                                                                               
ESTAB     0      0          127.0.0.1:2101      127.0.0.1:59804 users:(("one_hm.rb",pid=953,fd=42))                                                                                                                                           
ESTAB     0      0          127.0.0.1:2102      127.0.0.1:58300 users:(("one_hm.rb",pid=953,fd=43))                                                                                                                                           
ESTAB     0      0          127.0.0.1:2101      127.0.0.1:59816 users:(("one_hm.rb",pid=953,fd=44))

after 10 days without stop and start vm
11.05.2023:

#ss -antp|grep :21                                                                                                                                                                                                                             
LISTEN    0      100        127.0.0.1:2101        0.0.0.0:*     users:(("one_hm.rb",pid=953,fd=39))                                                                                                                                           
LISTEN    0      100        127.0.0.1:2102        0.0.0.0:*     users:(("one_hm.rb",pid=953,fd=41))                                                                                                                                           
ESTAB     0      0          127.0.0.1:59804     127.0.0.1:2101  users:(("ruby",pid=1322,fd=15))                                                                                                                                               
ESTAB     0      0          127.0.0.1:59816     127.0.0.1:2101  users:(("ruby",pid=1327,fd=19))                                                                                                                                               
ESTAB     0      0          127.0.0.1:58300     127.0.0.1:2102  users:(("ruby",pid=1322,fd=16))                                                                                                                                               
ESTAB     0      0          127.0.0.1:2102      127.0.0.1:58300 users:(("one_hm.rb",pid=953,fd=43))                                                                                                                                           
ESTAB     0      0          127.0.0.1:2101      127.0.0.1:59816 users:(("one_hm.rb",pid=953,fd=44))

if compared outputs then one string absent and HEM not worked.
OOM not executed.

Topic		Replies	Views
Can't execute hook with the new Hook system // Autorestart VM Product Support	7	1565	November 23, 2020
HOST failure HOOK execution failed Product Support	17	1813	October 24, 2019
Problem with migrating vms after host error hook launched Product Support	2	2234	September 12, 2017
Host error hook not working (5.4.13) Product Support	6	1589	July 12, 2018
Host hook not working (5.9) Configuration	2	1168	November 18, 2019

Hooks stop working and opennebula-hem needs a restart

Related topics