XML-RPC requests timeout

Running OpenNebula 5.4 on 5 combined hosts: all act as front-end and hypervisor nodes. Managed by FreeIPA. RAFT is configured and functioning properly
DB backend is PostgreSQL

Running any XML-RPC requests either via CLI ‘one*’ commands or in the UI results in Net::ReadTimeout

Resuming VM results in timeout, but ‘one.vm.info’ method is successfully completed some time after timeout, according to /var/log/one/oned.log

Sometimes VM is resumed after all, but mostly requests just timeout
Sometimes rebooting leader node results in VM automatically starting right away after new leader selection

upd: restarting opennebula service results in node functioning properly. What could be causing oned to execute requests this slow?

Hello @Vitaly_Varyvdin,

I’m sorry but we are not supporting 5.4 anymore (at least from the community side). My suggestion is to update your OpenNebula version (we are currently in 6.10), and check if the problem persists.

Then, there’s a comprehensive article about XML-RPC on our documentation.

Cheers,

Do you have any messages about slow DB queries in the /var/log/one/monitor.log like the ones described in another thread on that forum?

Have you tried to execute the same commands but with increased value for a ONE_XMLRPC_TIMEOUT env variable?
For example

ONE_XMLRPC_TIMEOUT=60 onevm list

But as @FrancJP suggested the first thing to do is to upgrade your OpenNebula.

I’d be glad to update, but can’t do that unfortunately
I have no /var/log/one/monitor.log on any of the hosts

XMLRPC timeout is set to 0 in /etc/one/oned.conf, shouldn’t that actually mean no timeout?
Increasing timeout to about 3 minutes would help, according to logs, but Sunstone would still fail to execute the commands

Could that be due to corrupted db across or on some of the hosts? Can’t use onedb fsck or anything else since cluster is running on PostgreSQL and onedb doesn’t seem to support it

upd:
Observing weird behavior:
5 hosts on, any request results in timeout
Stopping opennebula and unicorn-opennebula (latter is probably excess) on leader node, switching to new leader, requests are completed right away. Some time after - all requests timeout again. Starting previously stopped services doesn’t change anything, repeating same process on new leader solves the issue for another minute

no, ONE_XMLRPC_TIMEOUT=0 means immediate timeout, i.e. after 0 seconds. A default value for that parameter is 30 seconds.
So could you, please, try to set it e.g. to 30 on your leader FE, restart OpenNebula service (systemctl restart opennebula.service) and check if it helps to solve the issue?
If it helps then apply the same changes on all your other HA FE nodes.

If you meant some technical reasons then what are they?