OneFlow stuck in Warning

Hi,

we have an OpenNebula (5.12.0.3] on top of Ubuntu physical nodes.
Last week, a OneFlow service with four different roles went into WARNING, since one of the roles (SGE_ubuntu_standard) was in state WARNING. There are no running VMs in the service.

=> How to check Why the service / role is in WARNING ? Only weird thing I see in the logs is SGE_ubuntu_standard cardinality to -1.

I am unsure if the 6.0 docs apply to our 5.12, but in life-cycle I don’t see a way out of that state. I was hoping I could recover, but that’s not the case:

oneflow recover 18
Service cannot be recovered in state: WARNING

=> How to clear the WARNING state for that service ?

Yours,
Steffen

oneadmin@stratus:~$ oneflow show 18 
SERVICE 18 INFORMATION                                                          
ID                  : 18                  
NAME                : SGE                 
USER                : oneadmin            
GROUP               : oneadmin            
STRATEGY            : none                
SERVICE STATE       : WARNING             

PERMISSIONS                                                                     
OWNER               : um-                 
GROUP               : ---                 
OTHER               : ---                 

ROLE SGE_ubuntu_small
ROLE STATE          : RUNNING             
PARENTS             : SGE_master          
VM TEMPLATE         : 24                  
CARDINALITY         : 0                   
MIN VMS             : 0                   
MAX VMS             : 40                  

NODES INFORMATION
 VM_ID NAME                     USER            GROUP          

ROLE SGE_ubuntu_standard
ROLE STATE          : WARNING             
PARENTS             : SGE_master          
VM TEMPLATE         : 25                  
CARDINALITY         : 0                   
MIN VMS             : 0                   
MAX VMS             : 12                  

NODES INFORMATION
 VM_ID NAME                     USER            GROUP          

ROLE SGE_suse_small
ROLE STATE          : RUNNING             
PARENTS             : SGE_master          
VM TEMPLATE         : 26                  
CARDINALITY         : 0                   
MIN VMS             : 0                   
MAX VMS             : 40                  

NODES INFORMATION
 VM_ID NAME                     USER            GROUP          

ROLE SGE_suse_standard
ROLE STATE          : RUNNING             
PARENTS             : SGE_master          
VM TEMPLATE         : 27                  
CARDINALITY         : 0                   
MIN VMS             : 0                   
MAX VMS             : 12                  

NODES INFORMATION
 VM_ID NAME                     USER            GROUP          

ROLE SGE_master
ROLE STATE          : RUNNING             
VM TEMPLATE         : 42                  
CARDINALITY         : 0                   

NODES INFORMATION
 VM_ID NAME                     USER            GROUP          

LOG MESSAGES                                                                    
04/23/21 09:56 [I] Role SGE_ubuntu_standard scaling up from 0 to 1 nodes
04/23/21 09:56 [I] New state: SCALING
04/23/21 09:57 [I] New state: COOLDOWN
04/23/21 09:57 [I] New state: RUNNING
04/23/21 15:55 [I] Role SGE_ubuntu_standard scaling down from 1 to 0 nodes
04/25/21 12:14 [I] Role SGE_ubuntu_standard scaling up from 0 to 1 nodes
04/25/21 12:14 [I] New state: SCALING
04/25/21 12:14 [I] New state: COOLDOWN
04/25/21 12:14 [I] New state: RUNNING
04/25/21 14:16 [I] Role SGE_ubuntu_standard scaling up from 1 to 2 nodes
04/25/21 14:16 [I] New state: SCALING
04/25/21 14:17 [I] New state: COOLDOWN
04/25/21 14:17 [I] New state: RUNNING
04/25/21 15:55 [I] Role SGE_ubuntu_standard scaling down from 2 to 1 nodes
04/25/21 15:56 [I] Role SGE_ubuntu_standard scaling down from 1 to 0 nodes
04/25/21 16:58 [I] Role SGE_ubuntu_standard scaling up from 0 to 1 nodes
04/25/21 16:58 [I] New state: SCALING
04/25/21 16:58 [I] New state: COOLDOWN
04/25/21 16:58 [I] New state: RUNNING
04/25/21 16:59 [I] Role SGE_ubuntu_standard scaling down from 1 to 0 nodes
04/25/21 17:09 [I] Role SGE_ubuntu_standard scaling up from 0 to 10 nodes
04/25/21 17:09 [I] New state: SCALING
04/25/21 17:10 [I] New state: COOLDOWN
04/25/21 17:10 [I] New state: RUNNING
04/26/21 09:56 [I] Role SGE_ubuntu_standard scaling down from 10 to 9 nodes
04/26/21 09:56 [I] Role SGE_ubuntu_standard scaling down from 9 to 8 nodes
04/26/21 09:56 [I] Role SGE_ubuntu_standard scaling down from 8 to 7 nodes
04/26/21 09:56 [I] Role SGE_ubuntu_standard scaling down from 7 to 6 nodes
04/26/21 09:56 [I] Role SGE_ubuntu_standard scaling down from 6 to 5 nodes
04/26/21 09:56 [I] Role SGE_ubuntu_standard scaling down from 5 to 4 nodes
04/26/21 09:56 [I] Role SGE_ubuntu_standard scaling down from 4 to 3 nodes
04/26/21 09:56 [I] Role SGE_ubuntu_standard scaling down from 3 to 2 nodes
04/26/21 09:56 [I] Role SGE_ubuntu_standard scaling down from 2 to 1 nodes
04/26/21 09:56 [I] Role SGE_ubuntu_standard scaling down from 1 to 0 nodes
04/27/21 08:35 [I] Role SGE_ubuntu_standard scaling up from 0 to 1 nodes
04/27/21 08:35 [I] New state: SCALING
04/27/21 08:36 [I] New state: COOLDOWN
04/27/21 08:36 [I] New state: RUNNING
04/27/21 09:56 [I] Role SGE_ubuntu_standard scaling down from 1 to 0 nodes
04/27/21 10:43 [I] Role SGE_ubuntu_standard scaling up from 0 to 1 nodes
04/27/21 10:43 [I] New state: SCALING
04/27/21 10:44 [I] New state: COOLDOWN
04/27/21 10:44 [I] New state: RUNNING
04/28/21 09:56 [I] Role SGE_ubuntu_standard scaling down from 1 to 0 nodes
04/29/21 09:42 [I] Role SGE_ubuntu_standard scaling up from 0 to 1 nodes
04/29/21 09:42 [I] New state: SCALING
04/29/21 09:42 [I] New state: COOLDOWN
04/29/21 09:42 [I] New state: RUNNING
04/29/21 10:46 [I] New state: WARNING
04/29/21 21:56 [I] Role SGE_ubuntu_standard scaling down from 1 to 0 nodes

/var/log/one/oneflow.log:

Sun Apr 25 06:26:40 2021 [I]: [AE] Checking policies for service: 18
Sun Apr 25 06:26:57 2021 [I]: [WD] Warning 18: SGE_ubuntu_standard is DONE
Sun Apr 25 06:26:57 2021 [I]: [WD] Update 18:SGE_ubuntu_standard cardinality to -1
...
Sun Apr 25 15:15:08 2021 [I]: [AE] Checking policies for service: 18
Sun Apr 25 15:15:26 2021 [I]: [WD] Running 18: SGE_ubuntu_standard is ACTIVE
Sun Apr 25 15:15:56 2021 [I]: [WD] Running 18: SGE_ubuntu_standard is ACTIVE
Sun Apr 25 15:16:26 2021 [I]: [WD] Running 18: SGE_ubuntu_standard is ACTIVE
...
Thu Apr 29 10:45:44 2021 [I]: [AE] Checking policies for service: 18
Thu Apr 29 10:45:48 2021 [I]: [WD] Running 18: SGE_ubuntu_standard is ACTIVE
Thu Apr 29 10:46:18 2021 [I]: [WD] Running 18: SGE_ubuntu_standard is ACTIVE
Thu Apr 29 10:46:28 2021 [I]: [WD] Warning 18: SGE_ubuntu_standard is ACTIVE
Thu Apr 29 10:46:58 2021 [I]: [WD] Warning 18: SGE_ubuntu_standard is ACTIVE
Thu Apr 29 10:47:28 2021 [I]: [WD] Warning 18: SGE_ubuntu_standard is ACTIVE
Thu Apr 29 10:47:58 2021 [I]: [WD] Warning 18: SGE_ubuntu_standard is ACTIVE
Thu Apr 29 10:48:28 2021 [I]: [WD] Warning 18: SGE_ubuntu_standard is ACTIVE
Thu Apr 29 10:48:58 2021 [I]: [WD] Warning 18: SGE_ubuntu_standard is ACTIVE
Thu Apr 29 10:49:28 2021 [I]: [WD] Warning 18: SGE_ubuntu_standard is ACTIVE
Thu Apr 29 10:49:50 2021 [I]: [WD] Warning 18: SGE_ubuntu_standard is POWEROFF
Thu Apr 29 10:50:20 2021 [I]: [WD] Warning 18: SGE_ubuntu_standard is POWEROFF
Thu Apr 29 10:50:50 2021 [I]: [WD] Warning 18: SGE_ubuntu_standard is POWEROFF
...
Thu Apr 29 21:55:20 2021 [I]: [WD] Warning 18: SGE_ubuntu_standard is POWEROFF
Thu Apr 29 21:55:50 2021 [I]: [WD] Warning 18: SGE_ubuntu_standard is POWEROFF
Thu Apr 29 21:56:20 2021 [I]: [WD] Warning 18: SGE_ubuntu_standard is POWEROFF
Thu Apr 29 21:56:51 2021 [I]: [WD] Warning 18: SGE_ubuntu_standard is DONE
Thu Apr 29 21:56:51 2021 [I]: [WD] Update 18:SGE_ubuntu_standard cardinality to 0
Thu Apr 29 21:57:21 2021 [I]: [WD] Warning 18: SGE_ubuntu_standard is DONE
Thu Apr 29 21:57:21 2021 [I]: [WD] Update 18:SGE_ubuntu_standard cardinality to -1
Thu Apr 29 21:57:51 2021 [I]: [WD] Warning 18: SGE_ubuntu_standard is DONE
Thu Apr 29 21:57:51 2021 [I]: [WD] Update 18:SGE_ubuntu_standard cardinality to -1


...

...

Hello @sneumann,

That warning state is an special state that indicates that something in the service is not well, for example, a VM can be in UNKNOWN state. You need to check service VMs in order to find out the problem and resolve it. After that, the service should be in running state again.

Best,
Álex.

Hi Álex, thanks for the reply. Actually, there were no service VMs remaining, oneflow show showed all roles at cardinality=0. So if there are no VMs, there was nothing to check for UNKNOWN et al. IIRC I did check onevm list and saw nothing unusual.
We have experienced that weird state a few times, but couldn’t yet reproduce to open a proper github issue. Only solution in our case is to undeploy and redeploy the service.
Anyway, we’re good now, and will keep an eye open if there is a way to reproduce or whether it disappears and never comes back after our next upgrade.
Yours, Steffen

1 Like

Hello.

I hit the same issue this morning:

  • a VM had an issue and POWEROFF
  • one of our hook took care to make it RUNNING by resuming it

This finish by the role being in WARNING, so the service being in WARNING.

I solved the VM issue by simply terminating it and I don’t know how to ask ONE to recalculate the service state.

I can’t scale the role to get the VM, hopefully I had several VMs for it but now I don’t know what to do 🤷

Hello @DaD,

There is an issue in our GitHub. This will be fixed in next releases.

In the meantime, you can restart the OneFlow server to fix the WARNING issue.

Best,
Álex.

1 Like