Collectd-client.rb core dumped

Hi Friends

I have question about adding new host. When master node ssh to node1, it always fail to execute run_probe. As I debug the script, I found that this line in “collectd-client.rb” throw error

data = #{@run_probes_cmd} 2>&1

Google search about this issue return 2 solutions/issues
a) passwordless issue
b) SSH_CLIENT issue
c) try run “onehost sync --force”

I confirm that first 2 issues are solved in my system and the third one doesn’t help. All required files are valid , accessible and executable by oneadmin.

This thread looks similar to my issue. Unfortunately, there is no valid solution
http://users.opennebula.narkive.com/SEKiQERb/one-users-opennebula-node-ubuntu-14-04-saying-error-executing-collectd-client-rb-when-creating-a

One version is 5.2.0 , Ubuntu 16.04 , ruby 2.3

checking core dump file in /var/crash , it looks like it’s something in ruby itself. Not sure it’s a bug in ruby 2.3.

I try duplicate collectd-client.rb to test.rb and remove all lines except the line in question, it runs fine without error. Below is its content.

#!/usr/bin/env ruby

data = ./../run_probes kvm-probes /var/lib/one/datastores 4124 20 1 node1 2>&1
code = $?.exitstatus == 0
puts “#{code}”

Does anyone has solution or shed me some light where the issue is ? Ruby ?

thanks in advance,
Tor.

Checking further in coredump file, I found this error (and others)

Jan 10 21:16:59 node1 libvirtd[12096]: internal error: QEMU / QMP failed: Could not access KVM kernel module: No such file or directory

I guess that the root cause is I am running Ubuntu in VirtualBox. Although Ubuntu is KVM-ready but virtualbox can’t nest VM in guest OS.

What do you think ?

Even if the node doesn’t have virtualization extensions the monitoring part should work. Send us the /var/log/one/oned.log part where monitoring fails.

Also, which process is creating the core? You can find out with:

$ file <core file>

Thanks for your response Javi.

I also have the same thought, run_probe just collects statistic via collectd.
Today I setup another system in VMWare but still have no luck, crash file looks different this time.

Unfortunately, I can’t upload log file in this thread. Please find them from my share drive
https://drive.google.com/drive/folders/0B0EZTM0AkotHRU1Qem9vNDFaVDQ?usp=sharing

cheers!

The process that is crashing is ruby. It’s strange that it fails. Can you check that you have enough memory and the system is up to date?

Thanks for your advice. I run 2 guest hosts in my machine, each has 2GB memory and system is up to date (using “updater” tool). I monitor memory when collectd-client.rb crash, there is 1GB+ available. I don’t think it’s computing resource issue.

I try debug again in VMWare. /var/tmp/one/im/run_probes kvm script will run all scripts in kvm.d and kvm-probe.d folder. I try run each file manually and it totally fine. So , the root cause would sit in Ruby itself.

I install ruby package (sudo apt-get install ruby), is there any addition ruby packages required ?
(collectd-core package installed)

If I bypass collectd-client.rb but call runprobe kvm-prove manually it works fine.
Why collectd-client.rb fail to execute line 4 below , or it doesn’t support recursive, nahhh ?

  1. oneadmin@node1:/var/tmp/one/im/kvm.d$ ls
  2. collectd-client_control.sh collectd-client.rb
  3. oneadmin@node1:/var/tmp/one/im/kvm.d$ ./collectd-client_control.sh /var/lib/one/datastores 4124 20 1 node1 2>&1
  1. oneadmin@node1:/var/tmp/one/im/kvm.d$ ./…/run_probes kvm-probes /var/lib/one/datastores 4124 20 1 node1 2>&1
  2. ARCH=x86_64
  3. MODELNAME=“Intel® Core™ i5-4288U CPU @ 2.60GHz”
  4. HYPERVISOR=kvm
  5. TOTALCPU=200
  6. CPUSPEED=2599
  7. TOTALMEMORY=2964420
  8. USEDMEMORY=1117900
  9. FREEMEMORY=1846520
  10. FREECPU=198
  11. USEDCPU=2
  12. NETRX=0
  13. NETTX=0
  14. DS_LOCATION_USED_MB=5060
  15. DS_LOCATION_TOTAL_MB=97814
  16. DS_LOCATION_FREE_MB=87764
  17. HOSTNAME=node1
  18. VM_POLL=YES
  19. VERSION=“5.2.0”

I try change how collectd-client.rb execute that line but have no luck
from
data = #{@run_probes_cmd} 2>&1
to
data = ./../run_probes kvm-probes /var/lib/one/datastores 4124 20 1 node1 2>&1

Issue now is solved if I change /var/tmp/one/im/run_probe

from
if [-x “$i” ]; then

to
if [[ (-x “$i”) && ("$i" != “collectd-client.rb”) ]]; then

The reason is run_probe script will run all file in kvm.d folder. The first file it executes is collectd-client_control.sh which execute collectd-client.rb as background process and keep PID is /tmp/one-collectd-client.pid.

It looks to me that we already run collectd-client.rb in background, so we don’t need to let run_probe script executes collectd-client.rb again.

I am not 100% sure if this solution is the correct one. I will post consequence issue this change may caused.

create VM in that host … ok
host statistic … ok
vnc to VM … ok

Look Good !!

Hi Javi - Do you think it’s a bug ?

I don’t really understand what could be happening. We’ve been using the same system to start collectd client without problems.

Can you check that kvm.d/collectd-client.rb is not executable, that may be the problem. Here are the files from kvm.d from CentOS 7 packages.

[root@scw-ceab44 kvm.d]# ls -l
total 12
-rwxr-xr-x 1 oneadmin oneadmin 2901 Oct 17 11:09 collectd-client_control.sh
-rw-r--r-- 1 oneadmin oneadmin 4151 Oct 17 11:09 collectd-client.rb

Hi Javi

Yes, those files are executable

oneadmin@master:/var/tmp/one/im/kvm.d$ ls -al
total 20
drwxr-xr-x 2 oneadmin oneadmin 4096 Jan 17 21:00 .
drwxr-xr-x 7 oneadmin oneadmin 4096 Jan 17 21:00 …
-rwxr-xr-x 1 oneadmin oneadmin 2901 Jan 17 21:00 collectd-client_control.sh
-rwxr-xr-x 1 oneadmin oneadmin 4151 Jan 17 21:00 collectd-client.rb

I don’t know the real root cause either. I agree that collectd-client.rb execution from run_probe should works fine but unfortunately it’s not my case. Below is the conclusion how I fix this issue

  1. run_probe list all files in kvm.d folder
  2. run_probe script executes collectd-client_control.sh
  3. collect-client_control.sh run collectd-client.rb as background process and return
  4. run_probe execute collectd-client.rb again

The code I amend in run_probe script removes the step 4 above and now my system run totally fine.

thanks for your help !!
Tor.