We’re experiencing something werid (again…) with the OpenNebula Sunstone.
First, let me show you the commandline equivilant of what I’m doing (just to make sure it’s all clear):
As a first step welist all hosts:
$ onehost list
ID NAME CLUSTER RVM ALLOCATED_CPU ALLOCATED_MEM STAT
9 sf01.**** Cluster 0 0 / 800 (0%) 0K / 31.3G (0%) on
11 sf02.**** Cluster 0 0 / 800 (0%) 0K / 31.3G (0%) on
12 sf03.**** Cluster 0 0 / 800 (0%) 0K / 31.3G (0%) on
13 sf04.**** Cluster 0 0 / 800 (0%) 0K / 31.3G (0%) on
Then we show the information regarding one specific host:
$ onehost show 9
HOST 9 INFORMATION
ID : 9
NAME : sf01.****
CLUSTER : Cluster
STATE : MONITORED
IM_MAD : kvm
VM_MAD : kvm
VN_MAD : dummy
LAST MONITORING TIME : 03/17 12:44:18
(and so on)
Up to this point it all works nicely.
However, when we use the Sunstone interface the following happens:
The host listing is shown, instantly + update status and all.
When we now click a host to show the specific information we get the loading page and at that point nothing happens… It’s just stuck.
Looking at the logging we can see that the Get request comes through to the Sunstone. However we don’t get any erorrs. Monitoring the ond.log also doesn’t give anything to work with.
The funny part however is that when we first add a host (using sunstone), and immediatly disable that host, then we can view the host info (if we’re quick). Hosts that have been monitored can’t be viewed (even after disabling).
At first we thought it was our firewall. But all opennebula components are on the same machine + if we’re quick we can get some information (eventually). But after a host is properly added (thus, status on) it will simply fail to display in the sunstone.
Does anyone know where I should start looking? Debug logging is enabled for sunstone and oned but we don’t get any information regarding errors (seems like the requests disappears in /dev/null or something…).
Any insight on this matter would be greatly appreciated!!!
Maybe the full hos tinformation is important, below you’ll find the host along with all it’s attributes.
$ onehost show 9
HOST 9 INFORMATION
ID : 9
NAME : sf01.****
CLUSTER : Cluster
STATE : MONITORED
IM_MAD : kvm
VM_MAD : kvm
VN_MAD : dummy
LAST MONITORING TIME : 03/17 12:44:18
HOST SHARES
TOTAL MEM : 31.3G
USED MEM (REAL) : 923.4M
USED MEM (ALLOCATED) : 0K
TOTAL CPU : 800
USED CPU (REAL) : 12
USED CPU (ALLOCATED) : 0
RUNNING VMS : 0
MONITORING INFORMATION
ARCH="x86_64"
ARCH="x86_64"
ARCH="x86_64"
ARCH="x86_64"
CPUSPEED="2003"
CPUSPEED="2003"
CPUSPEED="2003"
CPUSPEED="2003"
HOSTNAME="sf01.****"
HOSTNAME="sf01.****"
HOSTNAME="sf01.****"
HOSTNAME="sf01.****"
HYPERVISOR="kvm"
HYPERVISOR="kvm"
HYPERVISOR="kvm"
HYPERVISOR="kvm"
MODELNAME="Intel(R) Xeon(R) CPU X5355 @ 2.66GHz"
MODELNAME="Intel(R) Xeon(R) CPU X5355 @ 2.66GHz"
MODELNAME="Intel(R) Xeon(R) CPU X5355 @ 2.66GHz"
MODELNAME="Intel(R) Xeon(R) CPU X5355 @ 2.66GHz"
NETRX="5079012522"
NETRX="5079041539"
NETRX="5079148700"
NETRX="5079166204"
NETTX="5367373336"
NETTX="5367414144"
NETTX="5367561384"
NETTX="5367583010"
RESERVED_CPU=""
RESERVED_MEM=""
VERSION="4.12.0"
VERSION="4.12.0"
VERSION="4.12.0"
VERSION="4.12.0"
VIRTUAL MACHINES
ID USER GROUP NAME STAT UCPU UMEM HOST TIME
I think the problem is with the repeated keys in the monitoring information. Could you check if there are no duplicated probes in var/lib/one/remotes/im/kvm-probes.d (front end) or /var/tmp/one/im/kvm-probes.d (nodes). There should be only a key=value per attribute in the host monitoring information.
I’ve checked that directory and it turned out there were a bunch of .rpmsave files.
1.5K -rwxr-xr-x. 1 oneadmin oneadmin 1.2K Mar 10 00:43 architecture.sh
1.5K -rwxr-xr-x. 1 oneadmin oneadmin 1.2K Jan 15 17:26 architecture.sh.rpmsave
1.5K -rwxr-xr-x. 1 oneadmin oneadmin 1.4K Mar 10 00:43 collectd-client-shepherd.sh
1.5K -rwxr-xr-x. 1 oneadmin oneadmin 1.4K Jan 15 17:26 collectd-client-shepherd.sh.rpmsave
1.5K -rwxr-xr-x. 1 oneadmin oneadmin 1.4K Mar 10 00:43 cpu.sh
1.5K -rwxr-xr-x. 1 oneadmin oneadmin 1.4K Jan 15 17:26 cpu.sh.rpmsave
3.5K -rwxr-xr-x. 1 oneadmin oneadmin 3.2K Mar 10 00:43 kvm.rb
3.5K -rwxr-xr-x. 1 oneadmin oneadmin 3.2K Jan 15 17:26 kvm.rb.rpmsave
2.5K -rwxr-xr-x. 1 oneadmin oneadmin 2.2K Mar 10 00:43 monitor_ds.sh
2.5K -rwxr-xr-x. 1 oneadmin oneadmin 2.2K Jan 15 17:26 monitor_ds.sh.rpmsave
1.5K -rwxr-xr-x. 1 oneadmin oneadmin 1.2K Mar 10 00:43 name.sh
1.5K -rwxr-xr-x. 1 oneadmin oneadmin 1.2K Jan 15 17:26 name.sh.rpmsave
1.5K -rwxr-xr-x. 1 oneadmin oneadmin 1.2K Mar 10 00:43 poll.sh
1.5K -rwxr-xr-x. 1 oneadmin oneadmin 1.2K Jan 15 17:26 poll.sh.rpmsave
1.5K -rwxr-xr-x. 1 oneadmin oneadmin 1.3K Mar 10 00:43 version.sh
1.5K -rwxr-xr-x. 1 oneadmin oneadmin 1.3K Jan 15 17:26 version.sh.rpmsave
So I’ve searched for those and removed all of them from /var/lib/one:
cd /var/lib/one
find . -name *.rpmsave | xargs rm
The frontend and KVM clients should be the same since they have the same /var/lib/one (stored on a glusterfs volume).
After that I restarted OpenNebula and did the test again.
But sadly still no luck.
I also tested removing a host and then adding it again to see if it would registrer nicely, but that didn’t seem to make any difference. The error console still gives the same error.
I then cleared all /var/tmp/one directories.
Then forced a sync, but this gave an error on a few hosts. Those that didn’t give an error seemed to work afterwards.
So as a last step I removed all hosts and added them again, after this step everything was working.
So I guess the main problem was the /var/tmp/one directory?