Collectd client connecting to 127.0.0.1:4124

Hello,

tenured OpenStack user/admin just converted to OpenNebula here (please be nice!).

I installed a “self-contained” box (Debian 9, ONE 5.4.6), and everything (well, almost) is working like a charm. When adding a second KVM node yesterday I noticed that the port collectd-client.rb is listening on kept changing on this one (it only changes at every restart on the first node) so went ahead and investigated.

Turns out, if I launch the command manually:

/var/tmp/one/im/run_probes kvm /var/lib/one/datastores 4124 20 2 host2

it exits after a while as “Aborted.”. Looks like the continuous port change is happening because the frontend node restarts it every time it decides to poll.

As following step I started a tcpdump on the first (frontend) node where collectd is running, and nothing was coming in. If running the same on the second node (id 2), where nothing is listening on 4124 I see the running instance of collectd-client.rb is trying to send something to that port, but on 127.0.0.1.

Not a dev, but I spent some time digging and reading the code, and really can’t get where the collectd host is passed as a variable. Might be a bug as well.

Is anybody facing the same? Any idea how to debug?

Ref: https://github.com/OpenNebula/one/blob/master/src/im_mad/remotes/common.d/collectd-client.rb

Hi Giorgio

OpenNebula starts a collectd daemon running in the Front-end that listens for UDP connections on port 4124. In the first monitoring cycle the OpenNebula connects to the host using ssh and starts a daemon that will execute the probe scripts and sends the collected data to the collectd daemon in the Frontend every specific amount of seconds. This way the monitoring subsystem doesn’t need to make new ssh connections to receive data. If the agent stops in a specific Host, OpenNebula will detect that no monitorization data is received from that hosts and will restart the probe with SSH.

There might be two issues here. First, as you said, the collectd daemon might be failing and restarting itself on another port because port 4124 is in use. If this is the case, it should restart the daemon inside the node that sends monitoring data with the new port as a parameter (it is specified next to /var/lib/one/datastores on the line you manually ran).

Second, perhaps there is an issue with ssh or the daemon inside the node is not properly executing.

Could you attach your oned.log and the output from journalctl --unit=opennebula?

Hey Sergio, thanks for getting back.

I’m not sure on one thing - shouldn’t collectd be listening on port 4124 on the frontend node, and other nodes connecting to that listening port, on that node (and not locally)? This was my understanding based on: https://docs.opennebula.org/5.2/deployment/open_cloud_host_setup/monitoring.html

(ie: frontend connects -ssh- to host2, gets data and starts the ruby daemon. from that point onward that ruby daemon will be sending metrics to frontend on UDP 4124)

oned.log shows no failures, but I guess because every time it connects back to host2:

Mon May 21 23:38:36 2018 [Z0][InM][D]: Monitoring host host2 (2)
Mon May 21 23:38:39 2018 [Z0][InM][D]: Host host2 (2) successfully monitored.

Exactly, that is what it should do. But the log fragment you just send makes it seem like everything is OK. If it was trying to connect back to host2 it should fail for two reasons, host2 doesn’t have a collectd server running (shouldn’t have) and second, that log belongs to oned daemon. If oned daemon failed to receive an answer it shouldn’t write that it did.

I assume the problem is that you can see monitoring information from host1 on your sunstone but not from host2, right? Could you check the time in your frontend and in host1 and host2?

Time is ok, UTC and based on NTP. What concerns me is that host2 is not even trying to connect to collectd on frontend - tcpdump shows packets going to 127.0.0.1:4124.

This is the main reason why I’m investigating this as a configuration issue somewhere and not as a broken service (anyway, I double checked that communication between nodes is okay and host2 can reach frontend on that port).

(a side note - I’m not sure why for this to work I need both a listening socket on the receiver and the sender, but this seems to be working on host1/frontend so not the problem here)

Super weird, isn’t it?

Hi giorgio, actually, there is only one socket listening and is located on the frontend. Compute nodes don’t have any service listening on UDP port 4124. Only one package should be installed on the compute nodes, opennebula-node-kvm, but there is nothing there that listens on port 4124 UDP. This is weird :slight_smile:

Still, although you have more services running inside the node this shouldn’t be a problem. But what I think it’s happening is that whatever extra component is installed on that node is starting the ruby monitoring server that sends data to the frontend before the actual frontend is able to start it, and it’s sending data to itself, it’s own collectd instance.

Also, could you please check the hostnames? You are not using ip addresses, so you should have either a DNS server or the hostnames configured on /etc/hosts. Could you check this info for the frontend and the two nodes?

Sorry - I didn’t mean there is something listening on 4124. This is what I see:

frontend:
collectd listening on UDP *:4124
/var/tmp/one/im/kvm.d/collectd-client.rb kvm /var/lib/one/datastores 4124 20 0 frontend listening on UDP *:50837 (port changes at every *explicit* restart)

host2:
/var/tmp/one/im/kvm.d/collectd-client.rb kvm /var/lib/one/datastores 4124 20 2 host2 listening on UDP *:60832 (it changes every 30 seconds)

So the additional question was - if that .rb is the script collecting and sending stuff to collectd, why is it itself listening on another port?

DNS is okay, and there is nothing else installed on host2: the .rb doesn’t start until frontend connects to it.