I’ve been running a home cluster of 4.10.2 for a few months using Ceph as the datastore for main images. All was working well for many weeks since the last reboot when all VMs were restarted (only because I moved house!).
I went to launch a new VM and it is stuck in the PENDING state. Every 30 seconds in /var/log/one/sched.log I get this:
Mon Jul 6 21:55:28 2015 [Z0][VM][D]: Pending/rescheduling VM and capacity requirements:
VM CPU Memory System DS Image DS
------------------------------------------------------------
44 100 524288 0 DS 105: 0
Mon Jul 6 21:55:28 2015 [Z0][HOST][D]: Discovered Hosts (enabled):
0 1
Mon Jul 6 21:55:28 2015 [Z0][SCHED][I]: Scheduling Results:
Virtual Machine: 44
PRI ID - HOSTS
------------------------
-1 1
-1 0
PRI ID - DATASTORES
------------------------
0 0
Mon Jul 6 21:55:28 2015 [Z0][SCHED][D]: VM 44: Local Datastore 0 in Host 1 filtered out. Not enough capacity.
Mon Jul 6 21:55:28 2015 [Z0][SCHED][I]: VM 44: No suitable System DS found for Host: 1. Filtering out host.
Mon Jul 6 21:55:28 2015 [Z0][SCHED][D]: VM 44: Local Datastore 0 in Host 0 filtered out. Not enough capacity.
Mon Jul 6 21:55:28 2015 [Z0][SCHED][I]: VM 44: No suitable System DS found for Host: 0. Filtering out host.
Sure enough, it’s on to something there. DS 0 has an undefined capacity
[root@userver1 ~]# onedatastore list
ID NAME SIZE AVAIL CLUSTER IMAGES TYPE DS TM
0 system - - - 0 sys - ssh
1 default 7.8G 51% - 0 img fs shared
2 files 7.8G 51% - 0 fil fs ssh
105 cephstore 4T 11% - 19 img ceph ceph
As you can see, I’m using the ssh TM. SSH keys for passwordless login as oneadmin are working fine - look
[root@userver1 ~]# su - oneadmin
Last login: Sun Jul 5 17:27:00 BST 2015 from userver2 on pts/1
[oneadmin@userver1 ~]$ ssh userver2
Warning: Permanently added 'userver2,10.0.0.62' (ECDSA) to the list of known hosts.
Last login: Sun Jul 5 17:26:56 2015 from userver1
[oneadmin@userver2 ~]$ ssh userver1
Warning: Permanently added 'userver1,10.0.0.61' (ECDSA) to the list of known hosts.
Last login: Mon Jul 6 22:00:12 2015
[oneadmin@userver1 ~]$ exit
logout
Connection to userver1 closed.
[oneadmin@userver2 ~]$ exit
logout
Connection to userver2 closed.
[oneadmin@userver1 ~]$
How can I ‘bump’ DS 0 and get it to show a size so that I can launch VMs again?
The ssh datastore does not have a capacity because the datastore is not global. The output of each ‘onehost show’ will contain the datastore capacity for the local disk.
Thanks for replying - I take your point about ssh not being a shared filesystem (if that’s what you mean by ‘not global’ ?) however I’m currently uncertain how that moves me closer to being able to launch VMs again.
And the output from onehost show for both nodes is as follows:
[root@userver1 ~]# onehost show 0
HOST 0 INFORMATION
ID : 0
NAME : userver1
CLUSTER : -
STATE : MONITORED
IM_MAD : kvm
VM_MAD : kvm
VN_MAD : dummy
LAST MONITORING TIME : 07/08 10:57:39
HOST SHARES
TOTAL MEM : 15.3G
USED MEM (REAL) : 0K
USED MEM (ALLOCATED) : 4.4G
TOTAL CPU : 2200
USED CPU (REAL) : 0
USED CPU (ALLOCATED) : 400
RUNNING VMS : 3
MONITORING INFORMATION
ARCH="x86_64"
CPUSPEED="800"
HOSTNAME="userver1.acentral.co.uk"
HYPERVISOR="kvm"
MODELNAME="AMD Turion(tm) II Neo N54L Dual-Core Processor"
NETRX="0"
NETTX="0"
RESERVED_CPU="-2200"
RESERVED_MEM="-16000000"
VERSION="4.10.2"
VIRTUAL MACHINES
ID USER GROUP NAME STAT UCPU UMEM HOST TIME
27 oneadmin oneadmin squeezeserver runn 1 1.2G userver1 100d 17h50
32 oneadmin oneadmin Windows7_janie runn 0 3.6G userver1 97d 12h14
41 oneadmin oneadmin OpenMediaVault runn 2 675.5M userver1 44d 09h44
[root@userver1 ~]# onehost show 1
HOST 1 INFORMATION
ID : 1
NAME : userver2
CLUSTER : -
STATE : MONITORED
IM_MAD : kvm
VM_MAD : kvm
VN_MAD : dummy
LAST MONITORING TIME : 07/08 10:57:39
HOST SHARES
TOTAL MEM : 9.5G
USED MEM (REAL) : 0K
USED MEM (ALLOCATED) : 3.6G
TOTAL CPU : 2200
USED CPU (REAL) : 0
USED CPU (ALLOCATED) : 400
RUNNING VMS : 3
MONITORING INFORMATION
ARCH="x86_64"
CPUSPEED="800"
HOSTNAME="userver2.acentral.co.uk"
HYPERVISOR="kvm"
MODELNAME="AMD Turion(tm) II Neo N54L Dual-Core Processor"
NETRX="0"
NETTX="0"
RESERVED_CPU="-2200"
RESERVED_MEM="-10000000"
VERSION="4.10.2"
VIRTUAL MACHINES
ID USER GROUP NAME STAT UCPU UMEM HOST TIME
17 oneadmin oneadmin Windows7_gdh runn 0 2G userver2 101d 09h07
23 oneadmin oneadmin eddie runn 3 1.3G userver2 101d 06h57
40 oneadmin oneadmin pfSense runn 0 384M userver2 44d 09h50
Any ideas (even temporary workarounds) would be really welcome!
Maybe you need to tune de DATASTORE_LOCATION in oned.conf. This is the based path for the ssh datastore, and needs to point to an existing path in the hosts. This can be also set per cluster
That’s a great suggestion and I had high hopes - that section of oned.conf was commented out:
#DATASTORE_LOCATION = /var/lib/one/datastores
which I expect means it’s taking the default shown value. I removed the # since /var/lib/one/datastores is where the other DS’s live:
[root@userver1 ~]# ls -l /var/lib/one/datastores/
total 0
drwxr-x--- 12 oneadmin oneadmin 91 May 25 01:14 0
drwxr-x--- 2 oneadmin oneadmin 6 Jan 15 16:26 1
drwxr-xr-x 2 oneadmin oneadmin 6 Mar 21 21:34 2
After that I issued /etc/init.d/opennebula restart and recreated the VM - however the outcome is exactly the same:
Thu Jul 9 19:34:01 2015 [Z0][VM][D]: Pending/rescheduling VM and capacity requirements:
VM CPU Memory System DS Image DS
------------------------------------------------------------
44 100 524288 0 DS 105: 0
Thu Jul 9 19:34:01 2015 [Z0][HOST][D]: Discovered Hosts (enabled):
0 1
Thu Jul 9 19:34:01 2015 [Z0][SCHED][I]: Scheduling Results:
Virtual Machine: 44
PRI ID - HOSTS
------------------------
-1 1
-1 0
PRI ID - DATASTORES
------------------------
0 0
Thu Jul 9 19:34:01 2015 [Z0][SCHED][D]: VM 44: Local Datastore 0 in Host 1 filtered out. Not enough capacity.
Thu Jul 9 19:34:01 2015 [Z0][SCHED][I]: VM 44: No suitable System DS found for Host: 1. Filtering out host.
Thu Jul 9 19:34:01 2015 [Z0][SCHED][D]: VM 44: Local Datastore 0 in Host 0 filtered out. Not enough capacity.
Thu Jul 9 19:34:01 2015 [Z0][SCHED][I]: VM 44: No suitable System DS found for Host: 0. Filtering out host.
Somehow the information is not getting to the probe or oned. Can you check
with a onehost sync --force command? You need to see the available
space in the onehost show output. Note that DATASTORE_LOCATION is for the
hosts (userver1 and userver2)
[oneadmin@userver1 root]$ onehost sync --force
* Adding userver1 to upgrade
* Adding userver2 to upgrade
[========================================] 2/2 userver2
All hosts updated successfully.
So, no problem from that point… so let’s look again at the onehost output:
HOST 0 INFORMATION
ID : 0
NAME : userver1
CLUSTER : -
STATE : MONITORED
IM_MAD : kvm
VM_MAD : kvm
VN_MAD : dummy
LAST MONITORING TIME : 07/14 20:12:03
HOST SHARES
TOTAL MEM : 15.3G
USED MEM (REAL) : 0K
USED MEM (ALLOCATED) : 4.4G
TOTAL CPU : 2200
USED CPU (REAL) : 0
USED CPU (ALLOCATED) : 400
RUNNING VMS : 3
MONITORING INFORMATION
ARCH="x86_64"
CPUSPEED="800"
HOSTNAME="userver1.acentral.co.uk"
HYPERVISOR="kvm"
MODELNAME="AMD Turion(tm) II Neo N54L Dual-Core Processor"
NETRX="0"
NETTX="0"
RESERVED_CPU="-2200"
RESERVED_MEM="-16000000"
VERSION="4.10.2"
VIRTUAL MACHINES
ID USER GROUP NAME STAT UCPU UMEM HOST TIME
27 oneadmin oneadmin squeezeserver runn 1 1.2G userver1 107d 03h05
32 oneadmin oneadmin Windows7_janie runn 0 3.6G userver1 103d 21h29
41 oneadmin oneadmin OpenMediaVault runn 2 675.5M userver1 50d 18h58
and on the second host…
[oneadmin@userver1 root]$ onehost show 1
HOST 1 INFORMATION
ID : 1
NAME : userver2
CLUSTER : -
STATE : MONITORED
IM_MAD : kvm
VM_MAD : kvm
VN_MAD : dummy
LAST MONITORING TIME : 07/14 20:12:03
HOST SHARES
TOTAL MEM : 9.5G
USED MEM (REAL) : 0K
USED MEM (ALLOCATED) : 3.6G
TOTAL CPU : 2200
USED CPU (REAL) : 0
USED CPU (ALLOCATED) : 400
RUNNING VMS : 3
MONITORING INFORMATION
ARCH="x86_64"
CPUSPEED="800"
HOSTNAME="userver2.acentral.co.uk"
HYPERVISOR="kvm"
MODELNAME="AMD Turion(tm) II Neo N54L Dual-Core Processor"
NETRX="0"
NETTX="0"
RESERVED_CPU="-2200"
RESERVED_MEM="-10000000"
VIRTUAL MACHINES
ID USER GROUP NAME STAT UCPU UMEM HOST TIME
17 oneadmin oneadmin Windows7_gdh runn 0 2G userver2 107d 18h22
23 oneadmin oneadmin eddie runn 3 1.3G userver2 107d 16h12
40 oneadmin oneadmin pfSense runn 0 384M userver2 50d 19h05
Doesn’t seem to be any change… but wait!
[oneadmin@userver1 root]$ onedatastore list
ID NAME SIZE AVAIL CLUSTER IMAGES TYPE DS TM
0 system 7.8G 50% - 0 sys - shared
1 default 7.8G 50% - 0 img fs shared
2 files 7.8G 50% - 0 fil fs ssh
105 cephstore 4T 11% - 19 img ceph ceph
The careful reader will notice that I’m now using the shared TM rather than ssh - I changed that last night and didn’t notice any difference in the datastore size (still blank) even after restarting the opennebula system service.
I can now deploy VMs again with the system datastore as long as I do it on userver
I did go the final step and change back to the ssh TM, and once again onehost sync --force.
Great news - it all works - thank you so much for your help - onehost sync --force as the oneadmin user was the magic potion!