tenured OpenStack user/admin just converted to OpenNebula here (please be nice!).
I installed a “self-contained” box (Debian 9, ONE 5.4.6), and everything (well, almost) is working like a charm. When adding a second KVM node yesterday I noticed that the port collectd-client.rb is listening on kept changing on this one (it only changes at every restart on the first node) so went ahead and investigated.
it exits after a while as “Aborted.”. Looks like the continuous port change is happening because the frontend node restarts it every time it decides to poll.
As following step I started a tcpdump on the first (frontend) node where collectd is running, and nothing was coming in. If running the same on the second node (id 2), where nothing is listening on 4124 I see the running instance of collectd-client.rb is trying to send something to that port, but on 127.0.0.1.
Not a dev, but I spent some time digging and reading the code, and really can’t get where the collectd host is passed as a variable. Might be a bug as well.
Is anybody facing the same? Any idea how to debug?
OpenNebula starts a collectd daemon running in the Front-end that listens for UDP connections on port 4124. In the first monitoring cycle the OpenNebula connects to the host using ssh and starts a daemon that will execute the probe scripts and sends the collected data to the collectd daemon in the Frontend every specific amount of seconds. This way the monitoring subsystem doesn’t need to make new ssh connections to receive data. If the agent stops in a specific Host, OpenNebula will detect that no monitorization data is received from that hosts and will restart the probe with SSH.
There might be two issues here. First, as you said, the collectd daemon might be failing and restarting itself on another port because port 4124 is in use. If this is the case, it should restart the daemon inside the node that sends monitoring data with the new port as a parameter (it is specified next to /var/lib/one/datastores on the line you manually ran).
Second, perhaps there is an issue with ssh or the daemon inside the node is not properly executing.
Could you attach your oned.log and the output from journalctl --unit=opennebula?
(ie: frontend connects -ssh- to host2, gets data and starts the ruby daemon. from that point onward that ruby daemon will be sending metrics to frontend on UDP 4124)
oned.log shows no failures, but I guess because every time it connects back to host2:
Exactly, that is what it should do. But the log fragment you just send makes it seem like everything is OK. If it was trying to connect back to host2 it should fail for two reasons, host2 doesn’t have a collectd server running (shouldn’t have) and second, that log belongs to oned daemon. If oned daemon failed to receive an answer it shouldn’t write that it did.
I assume the problem is that you can see monitoring information from host1 on your sunstone but not from host2, right? Could you check the time in your frontend and in host1 and host2?
Time is ok, UTC and based on NTP. What concerns me is that host2 is not even trying to connect to collectd on frontend - tcpdump shows packets going to 127.0.0.1:4124.
This is the main reason why I’m investigating this as a configuration issue somewhere and not as a broken service (anyway, I double checked that communication between nodes is okay and host2 can reach frontend on that port).
(a side note - I’m not sure why for this to work I need both a listening socket on the receiver and the sender, but this seems to be working on host1/frontend so not the problem here)
Hi giorgio, actually, there is only one socket listening and is located on the frontend. Compute nodes don’t have any service listening on UDP port 4124. Only one package should be installed on the compute nodes, opennebula-node-kvm, but there is nothing there that listens on port 4124 UDP. This is weird
Still, although you have more services running inside the node this shouldn’t be a problem. But what I think it’s happening is that whatever extra component is installed on that node is starting the ruby monitoring server that sends data to the frontend before the actual frontend is able to start it, and it’s sending data to itself, it’s own collectd instance.
Also, could you please check the hostnames? You are not using ip addresses, so you should have either a DNS server or the hostnames configured on /etc/hosts. Could you check this info for the frontend and the two nodes?