Since we have upgraded from 5.6.2 to 5.8.0 we are expériencing some high cpu usage on oned threads. After approx 24h we have 2 threads stuck at 100%. And after 48h we have even more threads stuck at 100%.
If I do some strace on those threads, I can see that they connect to RPC/XML port (I guess), I can see some HTTP headers about that, and then a lot of connection timeout.
If we restart opennebula, we are fine for approx 24h.
I’ve change this two keys in oned.conf concerning the timeout, KEEPALIVE_TIMEOUT and TIMEOUT. So we have this now :
MAX_CONN = 240
MAX_CONN_BACKLOG = 480
#KEEPALIVE_TIMEOUT = 15
KEEPALIVE_TIMEOUT = 30
#KEEPALIVE_MAX_CONN = 30
#TIMEOUT = 15
TIMEOUT = 30
I don’t know yet if it’s fine. Is that a good idea ?
Next I will try a onedb purge to remove old done VMs, and also to clean long history.
In oned.log I can only see some slow queries detected, mostly about replacing some value in vm_pool. I don’t know if it’s related… Nothing about connection timeout tho.
Any other lead I could follow ?
Best regards,
Edouard
Versions of the related components and OS (frontend, hypervisors, VMs):
OpenNebula 5.8.0 on CentOS 7
1681 VMs
Steps to reproduce:
It was fine with 5.6.2. Then upgrade to 5.8.0 make this problem happens
Current results:
oned threads stuck at 100% CPU
Expected results:
No threads stuck at 100%