for rollback sake I prepared brand new ONE FE 6.0, restored production DB into it, and granularly upgraded it to the 6.4 without problems. After that I am no longer able to see VM logs - they are empty and monitor.log is full of:
[Z0][MDP][W]: Failed to monitor VM state for host X: Error executing state.rb: database is locked.
All nodes (KVM) updated to 6.4, all host force synced. FE in Stand-alone mode. OS Ubuntu 20.04.
“virsh …” can be called by oneadmin.
There is an sqlite database on the virtualization node. The monitoring information is cached on there. There might be an issue with db file itself. It should look like
oneadmin@ubuntu2004-kvm-qcow2-6-4-rpZvf-2:~$ sqlite3 /var/tmp/one/im/status_kvm_1.db 'select * from states'
3320236b-56c8-4481-bd98-10602c2806bb|62|one-62|3320236b-56c8-4481-bd98-10602c2806bb|1654197356|0|RUNNING|kvm
having 1 row per VM. You can delete such database on the host and the monitoring service should recreate it on its own.
I can confirm, that there is such a database, it can be read, contains one VM per row, and when I delete it, monitoring re-creates it immediately.
The result stays the same unfortunately, still the same error message with one change - right after deleting the db file, I got:
Failed to monitor VM state for host X: Error executing state.rb: attempt to write a readonly database
for a few moments, then locked database again.
There were an overlooked monitoring processes accessing the same DB but with bad One FE address on the hosts (leftover from old FE HA cluster). After killing all process groups error messages stopped coming.
What steps have you performed? As a summary of this thread I recommend the following:
Delete the sqlite database located on the Hypervisor node in /var/tmp/one (you can delete the entire folder if you wish, this folder will be recreated at sync time).
Make sure to kill all existing monitor processes on the Hypervisor node. Multiple running processes at the same time may be due to a recent upgrade or an improper restart of the service.
Hi, indeed right after deleting sqlite3 file it gets recreated.
I tried to restart opennebula and running processes many times - the result stayed the same. Actually i pin pointed that state.rb probe is the one causing errors in the oned.log . I don’t know why.
by “Multiple running processes at the same time” - you mean i get the “database is locked” error because my processes are trying to read from database simultaneously ?
How can i kill state.rb process, when something starts state.rb script as new process every 3-4 seconds/