VMs in Unknow State when host error

Hello everyone,

I have built an OpenNebula cluster in an HA configuration with 3 KVM hosts, and it was working well. However, I encountered some errors. When I checked, all VMs on the third kvm host were in an ‘Unknown’ state. I checked the logs and found the following lines:

On leader frontend /var/log/one/monitor.log

Fri Oct 20 15:34:17 2023 [Z0][HMM][I]: Successfully monitored VM: 20
Fri Oct 20 15:34:17 2023 [Z0][HMM][I]: Successfully monitored VM: 22
Fri Oct 20 15:34:18 2023 [Z0][HMM][D]: Monitoring host kvmhost03(2)
Fri Oct 20 15:34:18 2023 [Z0][MDP][I]: Command execution failed (exit code: 24): exec 2>/dev/null; rsync -Laz --delete '/var/lib/one/remotes'/ 'kvmhost03':'/var/tmp/one'/
Fri Oct 20 15:34:18 2023 [Z0][MDP][W]: Start monitor failed for host 2: Could not update remotes
Fri Oct 20 15:34:18 2023 [Z0][HMM][E]: Unable to monitor host id: 2
Fri Oct 20 15:34:18 2023 [Z0][HMM][D]: Updated Host 2, state ERROR
Fri Oct 20 15:34:28 2023 [Z0][HMM][I]: Successfully monitored VM: 24
Fri Oct 20 15:34:28 2023 [Z0][HMM][I]: Successfully monitored VM: 25
Fri Oct 20 15:34:28 2023 [Z0][HMM][I]: Successfully monitored VM: 32
Fri Oct 20 15:34:28 2023 [Z0][HMM][I]: Successfully monitored VM: 33

On leader frontend /var/log/one/oned.log

Fri Oct 20 15:34:15 2023 [Z0][ReM][D]: Req:3312 UID:0 IP:127.0.0.1 one.grouppool.info invoked
Fri Oct 20 15:34:15 2023 [Z0][ReM][D]: Req:3312 UID:0 one.grouppool.info result SUCCESS, "<GROUP_POOL><GROUP><..."
Fri Oct 20 15:34:16 2023 [Z0][ReM][D]: Req:8992 UID:0 IP:127.0.0.1 one.zone.raftstatus invoked
Fri Oct 20 15:34:16 2023 [Z0][ReM][D]: Req:8992 UID:0 one.zone.raftstatus result SUCCESS, "<RAFT><SERVER_ID>3</..."
Fri Oct 20 15:34:18 2023 [Z0][ReM][D]: Req:8752 UID:0 IP:127.0.0.1 one.hostpool.info invoked
Fri Oct 20 15:34:18 2023 [Z0][ReM][D]: Req:8752 UID:0 one.hostpool.info result SUCCESS, "<HOST_POOL><HOST><ID..."
Fri Oct 20 15:34:18 2023 [Z0][ONE][E]: Error monitoring Host kvmhost03 (2):
Fri Oct 20 15:34:19 2023 [Z0][ReM][D]: Req:8384 UID:0 IP:127.0.0.1 one.zone.raftstatus invoked
Fri Oct 20 15:34:19 2023 [Z0][ReM][D]: Req:8384 UID:0 one.zone.raftstatus result SUCCESS, "<RAFT><SERVER_ID>3</..."
Fri Oct 20 15:34:19 2023 [Z0][ReM][D]: Req:3456 UID:0 IP:127.0.0.1 one.vmpool.infoextended invoked , -2, -1, -1, -1
Fri Oct 20 15:34:19 2023 [Z0][ReM][D]: Req:3456 UID:0 one.vmpool.infoextended result SUCCESS, "<VM_POOL><VM><ID>261..."
Fri Oct 20 15:34:19 2023 [Z0][ReM][D]: Req:3760 UID:0 IP:127.0.0.1 one.vmpool.infoextended invoked , -2, -1, -1, -1
Fri Oct 20 15:34:19 2023 [Z0][ReM][D]: Req:3760 UID:0 one.vmpool.infoextended result SUCCESS, "<VM_POOL><VM><ID>261..."

On the follower server-2 /var/log/one/monitor.log:

Fri Oct 20 14:43:51 2023 [Z0][MDP][W]: Failed to monitor VM state for host 2: Error executing state.rb: database is locked
Fri Oct 20 14:44:56 2023 [Z0][MDP][W]: Failed to monitor VM state for host 2: Error executing state.rb: database is locked
Fri Oct 20 14:53:40 2023 [Z0][HMM][I]: --Mark--
Fri Oct 20 15:03:40 2023 [Z0][HMM][I]: --Mark--
Fri Oct 20 15:13:40 2023 [Z0][HMM][I]: --Mark--
Fri Oct 20 15:21:14 2023 [Z0][MDP][W]: Failed to monitor VM state for host 2: Error executing state.rb: attempt to write a readonly database
Fri Oct 20 15:21:14 2023 [Z0][MDP][W]: Failed to monitor VM state for host 0: Error executing state.rb: attempt to write a readonly database
Fri Oct 20 15:23:40 2023 [Z0][HMM][I]: --Mark--
Fri Oct 20 15:33:40 2023 [Z0][HMM][I]: --Mark--
Fri Oct 20 15:34:21 2023 [Z0][HMM][D]: Updated Host 2, state ERROR
Fri Oct 20 15:37:34 2023 [Z0][HMM][D]: Updated Host 2, state MONITORED
Fri Oct 20 15:43:40 2023 [Z0][HMM][I]: --Mark--
Fri Oct 20 15:53:40 2023 [Z0][HMM][I]: --Mark--

On the follower server-2: /var/log/one/oned.log

Fri Oct 20 15:34:09 2023 [Z0][DiM][E]: poweroff_success action received but VM 74 not in ACTIVE state
Fri Oct 20 15:34:09 2023 [Z0][InM][D]: VM_STATE update from host: 0. VM id: 89, state: POWEROFF
Fri Oct 20 15:34:09 2023 [Z0][InM][D]: VM_STATE update from host: 0. VM id: 8, state: POWEROFF
Fri Oct 20 15:34:09 2023 [Z0][DBM][E]: Tried to modify DB being a follower
Fri Oct 20 15:34:09 2023 [Z0][DBM][E]: Tried to modify DB being a follower
Fri Oct 20 15:34:09 2023 [Z0][InM][D]: VM_STATE update from host: 0. VM id: 4, state: POWEROFF
Fri Oct 20 15:34:09 2023 [Z0][DiM][E]: poweroff_success action received but VM 80 not in ACTIVE state
Fri Oct 20 15:34:09 2023 [Z0][DBM][E]: Tried to modify DB being a follower
Fri Oct 20 15:34:09 2023 [Z0][DBM][E]: Tried to modify DB being a follower
Fri Oct 20 15:34:09 2023 [Z0][DiM][E]: poweroff_success action received but VM 83 not in ACTIVE state
Fri Oct 20 15:34:09 2023 [Z0][DBM][E]: Tried to modify DB being a follower
Fri Oct 20 15:34:09 2023 [Z0][DBM][E]: Tried to modify DB being a follower
Fri Oct 20 15:34:09 2023 [Z0][DBM][E]: Tried to modify DB being a follower
Fri Oct 20 15:34:09 2023 [Z0][DBM][E]: Tried to modify DB being a follower
Fri Oct 20 15:34:09 2023 [Z0][DBM][E]: Tried to modify DB being a follower
Fri Oct 20 15:34:09 2023 [Z0][DBM][E]: Tried to modify DB being a follower
Fri Oct 20 15:34:09 2023 [Z0][DiM][E]: poweroff_success action received but VM 85 not in ACTIVE state
Fri Oct 20 15:34:09 2023 [Z0][DiM][E]: poweroff_success action received but VM 89 not in ACTIVE state
Fri Oct 20 15:34:09 2023 [Z0][DiM][E]: poweroff_success action received but VM 8 not in ACTIVE state
Fri Oct 20 15:34:09 2023 [Z0][DiM][E]: poweroff_success action received but VM 14 not in ACTIVE state
Fri Oct 20 15:34:18 2023 [Z0][ReM][D]: Req:4608 UID:0 IP:127.0.0.1 one.zone.raftstatus invoked
Fri Oct 20 15:34:18 2023 [Z0][ReM][D]: Req:4608 UID:0 one.zone.raftstatus result SUCCESS, "<RAFT><SERVER_ID>2</..."
Fri Oct 20 15:34:18 2023 [Z0][ReM][D]: Req:9664 UID:0 IP:127.0.0.1 one.zone.raftstatus invoked
Fri Oct 20 15:34:18 2023 [Z0][ReM][D]: Req:9664 UID:0 one.zone.raftstatus result SUCCESS, "<RAFT><SERVER_ID>2</..."
Fri Oct 20 15:34:21 2023 [Z0][InM][D]: VM_STATE update from host: 2. VM id: 36, state: RUNNING
Fri Oct 20 15:34:21 2023 [Z0][DBM][E]: Tried to modify DB being a follower
Fri Oct 20 15:34:21 2023 [Z0][InM][D]: VM_STATE update from host: 2. VM id: 46, state: RUNNING
Fri Oct 20 15:34:21 2023 [Z0][DBM][E]: Tried to modify DB being a follower
Fri Oct 20 15:34:21 2023 [Z0][DBM][E]: Tried to modify DB being a follower
Fri Oct 20 15:34:21 2023 [Z0][InM][D]: VM_STATE update from host: 2. VM id: 239, state: RUNNING
Fri Oct 20 15:34:21 2023 [Z0][DBM][E]: Tried to modify DB being a follower

Zone info (onezone show):

ZONE 0 INFORMATION
ID                : 0
NAME              : opennebula.zone
STATE             : ENABLED

ZONE SERVERS
ID NAME            ENDPOINT
 2 server-2        http://192.168.0.1:2633/RPC2
 3 server-0        http://192.168.0.2:2633/RPC2
 4 server-1        http://192.168.0.3:2633/RPC2

HA & FEDERATION SYNC STATUS
ID NAME            STATE      TERM       INDEX      COMMIT     VOTE  FED_INDEX 
2 server-2        follower   1575       660243     660243     3     -1
3 server-0        leader     1575       660243     660243     3     -1
4 server-1        follower   1575       660243     660243     -1    -1

ZONE TEMPLATE
ENDPOINT="http://localhost:2633/RPC2"

Host list:

ID NAME      CLUSTER    TVM      ALLOCATED_CPU      ALLOCATED_MEM    STAT
 2 kvmhost03 default      8      120 / 480 (25%)     10G / 32G (31%) on  
 1 kvmhost02 default     10      480 / 720 (66%)     16G / 32G (50   on  
 0 kvmhost01 default     10      520 / 720 (72%)     15G / 32G (46   on  

"Note: The frontend server-2 is running on the same server as kvmhost03.

I have a few questions:

  1. Why is kvmhost03 marked as ‘ERROR’ (I need more information, such as network timeouts)?
  2. Why can’t VMs automatically recover their state when the host returns to the ‘ON’ state?
  3. Why are there log entries like ‘poweroff_success action received but VM 83 is not in the ACTIVE state’ even though VM 83 is running on KVM host 01 and there was no ‘poweroff’ action?
  4. Occasionally, I see the log entry ‘Failed to monitor VM state for host 2: Error executing state.rb: attempt to write a readonly database,’ but everything seems to be functioning normally, so I haven’t paid much attention to it. Is it really normal?

I’ve read this link: Host ERROR state and VM's UNKNOWN
But I’m using Ubuntu and don’t want to restart libvirtd. Is there a safer alternative? Currently, it’s still okay if I use the ‘recover’ option in Sunstone.

Thanks for your help.

Versions of the related components and OS (frontend, hypervisors, VMs): Server OS: Ubuntu 22.04,frontend: Opennebula CE 6.6.3, hypervisors: KVM.

Hi @Khang_Nguyen_Phuc

For this error:

Fri Oct 20 15:21:14 2023 [Z0][MDP][W]: Failed to monitor VM state for host 2: Error executing state.rb: attempt to write a readonly database

Could you try to delete the /var/tmp/one folder from your hypervisor node? Then, from your Frontend node, run the onehost sync --force command as oneadmin user to recreate your remotes.

Best,
Víctor.

I’m really thankful for your feedback!!!
I’d like to ask if running VMs are affected by deleting the remote folder?. Also, is it possible for the “onehost sync -f” command to encounter errors during execution? If it does and the remote folder isn’t recreated, what should be done? And can I run the “onehost sync” command without deleting the remote folder?
Thank you and best regards