New VMs stuck in PENDING - System datastore has no size

Hi all,

I’ve been running a home cluster of 4.10.2 for a few months using Ceph as the datastore for main images. All was working well for many weeks since the last reboot when all VMs were restarted (only because I moved house!).

I went to launch a new VM and it is stuck in the PENDING state. Every 30 seconds in /var/log/one/sched.log I get this:

Mon Jul  6 21:55:28 2015 [Z0][VM][D]: Pending/rescheduling VM and capacity requirements:
      VM  CPU      Memory   System DS  Image DS
------------------------------------------------------------
      44  100      524288           0  DS 105: 0 
Mon Jul  6 21:55:28 2015 [Z0][HOST][D]: Discovered Hosts (enabled):
 0 1
Mon Jul  6 21:55:28 2015 [Z0][SCHED][I]: Scheduling Results:
Virtual Machine: 44

    PRI    ID - HOSTS
    ------------------------
    -1    1
    -1    0

    PRI    ID - DATASTORES
    ------------------------
    0    0


Mon Jul  6 21:55:28 2015 [Z0][SCHED][D]: VM 44: Local Datastore 0 in Host 1 filtered out. Not enough capacity.
Mon Jul  6 21:55:28 2015 [Z0][SCHED][I]: VM 44: No suitable System DS found for Host: 1. Filtering out host.
Mon Jul  6 21:55:28 2015 [Z0][SCHED][D]: VM 44: Local Datastore 0 in Host 0 filtered out. Not enough capacity.
Mon Jul  6 21:55:28 2015 [Z0][SCHED][I]: VM 44: No suitable System DS found for Host: 0. Filtering out host.

Sure enough, it’s on to something there. DS 0 has an undefined capacity

[root@userver1 ~]# onedatastore list
  ID NAME                SIZE AVAIL CLUSTER      IMAGES TYPE DS       TM      
   0 system                 - -     -                 0 sys  -        ssh
   1 default             7.8G 51%   -                 0 img  fs       shared
   2 files               7.8G 51%   -                 0 fil  fs       ssh
 105 cephstore             4T 11%   -                19 img  ceph     ceph

As you can see, I’m using the ssh TM. SSH keys for passwordless login as oneadmin are working fine - look :smile:

[root@userver1 ~]# su - oneadmin
Last login: Sun Jul  5 17:27:00 BST 2015 from userver2 on pts/1
[oneadmin@userver1 ~]$ ssh userver2
Warning: Permanently added 'userver2,10.0.0.62' (ECDSA) to the list of known hosts.
Last login: Sun Jul  5 17:26:56 2015 from userver1
[oneadmin@userver2 ~]$ ssh userver1
Warning: Permanently added 'userver1,10.0.0.61' (ECDSA) to the list of known hosts.
Last login: Mon Jul  6 22:00:12 2015
[oneadmin@userver1 ~]$ exit
logout
Connection to userver1 closed.
[oneadmin@userver2 ~]$ exit
logout
Connection to userver2 closed.
[oneadmin@userver1 ~]$ 

How can I ‘bump’ DS 0 and get it to show a size so that I can launch VMs again?

Cheers,
Gavin.

Hi,

The ssh datastore does not have a capacity because the datastore is not global. The output of each ‘onehost show’ will contain the datastore capacity for the local disk.

Hi Carlos,

Thanks for replying - I take your point about ssh not being a shared filesystem (if that’s what you mean by ‘not global’ ?) however I’m currently uncertain how that moves me closer to being able to launch VMs again.

There’s plenty of local disk space on both nodes:

[root@userver1 ~]# df
Filesystem                  1K-blocks       Used Available Use% Mounted on
/dev/mapper/centos-root       8181760    4006624   4175136  49% /
devtmpfs                      8078556          0   8078556   0% /dev
tmpfs                         8118444          4   8118440   1% /dev/shm
tmpfs                         8118444     785416   7333028  10% /run
tmpfs                         8118444          0   8118444   0% /sys/fs/cgroup
/dev/mapper/vg0-vol_2tb_01 2146435072 1919362588 227072484  90% /mnt/2tb_01
/dev/sdc1                      508588     106172    402416  21% /boot
[root@userver2 ~]# df
Filesystem                        1K-blocks       Used Available Use% Mounted on
/dev/mapper/centos_userver2-root    8181760    3258112   4923648  40% /
devtmpfs                            4979464          0   4979464   0% /dev
tmpfs                               5021988          4   5021984   1% /dev/shm
tmpfs                               5021988     508588   4513400  11% /run
tmpfs                               5021988          0   5021988   0% /sys/fs/cgroup
/dev/mapper/vg0-vol_2tb_01       2146435072 1919371684 227063388  90% /mnt/2tb_01
/dev/sda1                            508588     150100    358488  30% /boot

And the output from onehost show for both nodes is as follows:

[root@userver1 ~]# onehost show 0
HOST 0 INFORMATION                                                              
ID                    : 0                   
NAME                  : userver1            
CLUSTER               : -                   
STATE                 : MONITORED           
IM_MAD                : kvm                 
VM_MAD                : kvm                 
VN_MAD                : dummy               
LAST MONITORING TIME  : 07/08 10:57:39      

HOST SHARES                                                                     
TOTAL MEM             : 15.3G               
USED MEM (REAL)       : 0K                  
USED MEM (ALLOCATED)  : 4.4G                
TOTAL CPU             : 2200                
USED CPU (REAL)       : 0                   
USED CPU (ALLOCATED)  : 400                 
RUNNING VMS           : 3                   

MONITORING INFORMATION                                                          
ARCH="x86_64"
CPUSPEED="800"
HOSTNAME="userver1.acentral.co.uk"
HYPERVISOR="kvm"
MODELNAME="AMD Turion(tm) II Neo N54L Dual-Core Processor"
NETRX="0"
NETTX="0"
RESERVED_CPU="-2200"
RESERVED_MEM="-16000000"
VERSION="4.10.2"

VIRTUAL MACHINES

    ID USER     GROUP    NAME            STAT UCPU    UMEM HOST             TIME
    27 oneadmin oneadmin squeezeserver   runn    1    1.2G userver1   100d 17h50
    32 oneadmin oneadmin Windows7_janie  runn    0    3.6G userver1    97d 12h14
    41 oneadmin oneadmin OpenMediaVault  runn    2  675.5M userver1    44d 09h44
[root@userver1 ~]# onehost show 1
HOST 1 INFORMATION                                                              
ID                    : 1                   
NAME                  : userver2            
CLUSTER               : -                   
STATE                 : MONITORED           
IM_MAD                : kvm                 
VM_MAD                : kvm                 
VN_MAD                : dummy               
LAST MONITORING TIME  : 07/08 10:57:39      

HOST SHARES                                                                     
TOTAL MEM             : 9.5G                
USED MEM (REAL)       : 0K                  
USED MEM (ALLOCATED)  : 3.6G                
TOTAL CPU             : 2200                
USED CPU (REAL)       : 0                   
USED CPU (ALLOCATED)  : 400                 
RUNNING VMS           : 3                   

MONITORING INFORMATION                                                          
ARCH="x86_64"
CPUSPEED="800"
HOSTNAME="userver2.acentral.co.uk"
HYPERVISOR="kvm"
MODELNAME="AMD Turion(tm) II Neo N54L Dual-Core Processor"
NETRX="0"
NETTX="0"
RESERVED_CPU="-2200"
RESERVED_MEM="-10000000"
VERSION="4.10.2"

VIRTUAL MACHINES

    ID USER     GROUP    NAME            STAT UCPU    UMEM HOST             TIME
    17 oneadmin oneadmin Windows7_gdh    runn    0      2G userver2   101d 09h07
    23 oneadmin oneadmin eddie           runn    3    1.3G userver2   101d 06h57
    40 oneadmin oneadmin pfSense         runn    0    384M userver2    44d 09h50

Any ideas (even temporary workarounds) would be really welcome! :smile:

Cheers,
Gavin.

Maybe you need to tune de DATASTORE_LOCATION in oned.conf. This is the based path for the ssh datastore, and needs to point to an existing path in the hosts. This can be also set per cluster

Hi @ruben - thanks for the reply!

That’s a great suggestion and I had high hopes - that section of oned.conf was commented out:

#DATASTORE_LOCATION  = /var/lib/one/datastores

which I expect means it’s taking the default shown value. I removed the # since /var/lib/one/datastores is where the other DS’s live:

[root@userver1 ~]# ls -l /var/lib/one/datastores/
total 0
drwxr-x--- 12 oneadmin oneadmin 91 May 25 01:14 0
drwxr-x---  2 oneadmin oneadmin  6 Jan 15 16:26 1
drwxr-xr-x  2 oneadmin oneadmin  6 Mar 21 21:34 2

After that I issued /etc/init.d/opennebula restart and recreated the VM - however the outcome is exactly the same:

Thu Jul  9 19:34:01 2015 [Z0][VM][D]: Pending/rescheduling VM and capacity requirements:
      VM  CPU      Memory   System DS  Image DS
------------------------------------------------------------
      44  100      524288           0  DS 105: 0 
Thu Jul  9 19:34:01 2015 [Z0][HOST][D]: Discovered Hosts (enabled):
 0 1
Thu Jul  9 19:34:01 2015 [Z0][SCHED][I]: Scheduling Results:
Virtual Machine: 44

	PRI	ID - HOSTS
	------------------------
	-1	1
	-1	0

	PRI	ID - DATASTORES
	------------------------
	0	0


Thu Jul  9 19:34:01 2015 [Z0][SCHED][D]: VM 44: Local Datastore 0 in Host 1 filtered out. Not enough capacity.
Thu Jul  9 19:34:01 2015 [Z0][SCHED][I]: VM 44: No suitable System DS found for Host: 1. Filtering out host.
Thu Jul  9 19:34:01 2015 [Z0][SCHED][D]: VM 44: Local Datastore 0 in Host 0 filtered out. Not enough capacity.
Thu Jul  9 19:34:01 2015 [Z0][SCHED][I]: VM 44: No suitable System DS found for Host: 0. Filtering out host.

What else could be wrong?

Cheers,
Gavin.

Somehow the information is not getting to the probe or oned. Can you check
with a onehost sync --force command? You need to see the available
space in the onehost show output. Note that DATASTORE_LOCATION is for the
hosts (userver1 and userver2)

Ruben,

[oneadmin@userver1 root]$ onehost sync --force
* Adding userver1 to upgrade
* Adding userver2 to upgrade
[========================================] 2/2 userver2                         
All hosts updated successfully.

So, no problem from that point… so let’s look again at the onehost output:

HOST 0 INFORMATION                                                              
ID                    : 0                   
NAME                  : userver1            
CLUSTER               : -                   
STATE                 : MONITORED           
IM_MAD                : kvm                 
VM_MAD                : kvm                 
VN_MAD                : dummy               
LAST MONITORING TIME  : 07/14 20:12:03      

HOST SHARES                                                                     
TOTAL MEM             : 15.3G               
USED MEM (REAL)       : 0K                  
USED MEM (ALLOCATED)  : 4.4G                
TOTAL CPU             : 2200                
USED CPU (REAL)       : 0                   
USED CPU (ALLOCATED)  : 400                 
RUNNING VMS           : 3                   

MONITORING INFORMATION                                                          
ARCH="x86_64"
CPUSPEED="800"
HOSTNAME="userver1.acentral.co.uk"
HYPERVISOR="kvm"
MODELNAME="AMD Turion(tm) II Neo N54L Dual-Core Processor"
NETRX="0"
NETTX="0"
RESERVED_CPU="-2200"
RESERVED_MEM="-16000000"
VERSION="4.10.2"

VIRTUAL MACHINES

    ID USER     GROUP    NAME            STAT UCPU    UMEM HOST             TIME
    27 oneadmin oneadmin squeezeserver   runn    1    1.2G userver1   107d 03h05
    32 oneadmin oneadmin Windows7_janie  runn    0    3.6G userver1   103d 21h29
    41 oneadmin oneadmin OpenMediaVault  runn    2  675.5M userver1    50d 18h58

and on the second host…

[oneadmin@userver1 root]$ onehost show 1
HOST 1 INFORMATION                                                              
ID                    : 1                   
NAME                  : userver2            
CLUSTER               : -                   
STATE                 : MONITORED           
IM_MAD                : kvm                 
VM_MAD                : kvm                 
VN_MAD                : dummy               
LAST MONITORING TIME  : 07/14 20:12:03      

HOST SHARES                                                                     
TOTAL MEM             : 9.5G                
USED MEM (REAL)       : 0K                  
USED MEM (ALLOCATED)  : 3.6G                
TOTAL CPU             : 2200                
USED CPU (REAL)       : 0                   
USED CPU (ALLOCATED)  : 400                 
RUNNING VMS           : 3                   

MONITORING INFORMATION                                                          
ARCH="x86_64"
CPUSPEED="800"
HOSTNAME="userver2.acentral.co.uk"
HYPERVISOR="kvm"
MODELNAME="AMD Turion(tm) II Neo N54L Dual-Core Processor"
NETRX="0"
NETTX="0"
RESERVED_CPU="-2200"
RESERVED_MEM="-10000000"

VIRTUAL MACHINES

    ID USER     GROUP    NAME            STAT UCPU    UMEM HOST             TIME
    17 oneadmin oneadmin Windows7_gdh    runn    0      2G userver2   107d 18h22
    23 oneadmin oneadmin eddie           runn    3    1.3G userver2   107d 16h12
    40 oneadmin oneadmin pfSense         runn    0    384M userver2    50d 19h05

Doesn’t seem to be any change… but wait!

[oneadmin@userver1 root]$ onedatastore list
  ID NAME                SIZE AVAIL CLUSTER      IMAGES TYPE DS       TM      
   0 system              7.8G 50%   -                 0 sys  -        shared
   1 default             7.8G 50%   -                 0 img  fs       shared
   2 files               7.8G 50%   -                 0 fil  fs       ssh
 105 cephstore             4T 11%   -                19 img  ceph     ceph

The careful reader will notice that I’m now using the shared TM rather than ssh - I changed that last night and didn’t notice any difference in the datastore size (still blank) even after restarting the opennebula system service.

I can now deploy VMs again with the system datastore as long as I do it on userver :smile:

I did go the final step and change back to the ssh TM, and once again onehost sync --force.

Great news - it all works - thank you so much for your help - onehost sync --force as the oneadmin user was the magic potion!

Cheers,
Gavin.

Just had to add that, five years later, I had the same problem on this old system and onehost sync --force fixed it again :smiley: