Running OpenNebula 5.4 on 5 combined hosts: all act as front-end and hypervisor nodes. Managed by FreeIPA. RAFT is configured and functioning properly
DB backend is PostgreSQL
Running any XML-RPC requests either via CLI ‘one*’ commands or in the UI results in Net::ReadTimeout
Resuming VM results in timeout, but ‘one.vm.info’ method is successfully completed some time after timeout, according to /var/log/one/oned.log
Sometimes VM is resumed after all, but mostly requests just timeout
Sometimes rebooting leader node results in VM automatically starting right away after new leader selection
upd: restarting opennebula service results in node functioning properly. What could be causing oned to execute requests this slow?
I’m sorry but we are not supporting 5.4 anymore (at least from the community side). My suggestion is to update your OpenNebula version (we are currently in 6.10), and check if the problem persists.
Then, there’s a comprehensive article about XML-RPC on our documentation.
I’d be glad to update, but can’t do that unfortunately
I have no /var/log/one/monitor.log on any of the hosts
XMLRPC timeout is set to 0 in /etc/one/oned.conf, shouldn’t that actually mean no timeout?
Increasing timeout to about 3 minutes would help, according to logs, but Sunstone would still fail to execute the commands
Could that be due to corrupted db across or on some of the hosts? Can’t use onedb fsck or anything else since cluster is running on PostgreSQL and onedb doesn’t seem to support it
upd:
Observing weird behavior:
5 hosts on, any request results in timeout
Stopping opennebula and unicorn-opennebula (latter is probably excess) on leader node, switching to new leader, requests are completed right away. Some time after - all requests timeout again. Starting previously stopped services doesn’t change anything, repeating same process on new leader solves the issue for another minute
no, ONE_XMLRPC_TIMEOUT=0 means immediate timeout, i.e. after 0 seconds. A default value for that parameter is 30 seconds.
So could you, please, try to set it e.g. to 30 on your leader FE, restart OpenNebula service (systemctl restart opennebula.service) and check if it helps to solve the issue?
If it helps then apply the same changes on all your other HA FE nodes.