Cannot persuade the Hook Manager to do anything

We are trying to create a hook that will add or delete IP addresses assigned by Open Nebula to a local authoritative DNS server but I cannot get the Hook Manager to actually show any life.

The Hook manager itself is running:

$ ps -ef|grep -i hm
oneadmin  3232 31092  0 16:41 pts/3    00:00:00 grep --color=auto -i hm
oneadmin 25604 25574  0 15:37 ?        00:00:00 ruby /usr/lib/one/mads/one_hm.rb -p 2101 -l 2102 -b 127.0.0.1
$

There are two hooks defined:

$ onehook list
  ID NAME                                                                                                                                TYPE
  11 hook-vm                                                                                                                             state
  12 add_ip_to_dns                                                                                                                api
$

The “hook-vm” is taken from the example provided in the documentation, the second one is a simple one that uses “/usr/bin/echo” as the command.

I create a VM:

$ onevm create --cpu 1 --memory 2 --name test
ID: 77
$

I wait for the VM to get to a RUNNING state, then I check the hooks, which are supposed to have executed:

$ onehook show 11
HOOK 11 INFORMATION
ID                : 11
NAME              : hook-vm
TYPE              : state
LOCK              : None

HOOK TEMPLATE
ARGUMENTS="$TEMPLATE pending"
COMMAND="vm-pending.rb"
LCM_STATE="LCM_INIT"
REMOTE="NO"
RESOURCE="VM"
STATE="PENDING"

-bash-4.2$ onehook show 12
HOOK 12 INFORMATION
ID                : 12
NAME              : add_ip_to_dns
TYPE              : api
LOCK              : None

HOOK TEMPLATE
ARGUMENTS="$API"
ARGUMENTS_STDIN="yes"
CALL="one.vm.action"
COMMAND="/usr/bin/echo"
REMOTE="NO"

$

As you can see there are no log entries. I looked for any logs, but there are none:

$ onehook log
$ echo $?
0
$

According to the documentation the HEM should log to /var/log/one/onehem.log but that doesn’t exist. Note that the log directory is owned by the oneadmin user:

$ ls -l /var/log/one/onehem.log
ls: cannot access /var/log/one/onehem.log: No such file or directory
$ ls -ald /var/log/one
drwxr-x---. 2 oneadmin oneadmin 12288 Oct 12 16:51 /var/log/one
$

There are lots of other log files in there as well so I’m guessing it’s not a permissions error.

I can see events appearing in the oned.log file:

Tue Oct 12 16:51:23 2021 [Z0][ReM][D]: Req:6080 UID:0 IP:127.0.0.1 one.vm.allocate invoked , "NAME="test"
Tue Oct 12 16:51:23 2021 [Z0][ReM][D]: Req:6080 UID:0 one.vm.allocate result SUCCESS, 77
Tue Oct 12 16:51:25 2021 [Z0][ReM][D]: Req:9264 UID:0 IP:127.0.0.1 one.vm.deploy invoked , 77, 6, false, 0, ""
Tue Oct 12 16:51:25 2021 [Z0][ReM][D]: Req:9264 UID:0 one.vm.deploy result SUCCESS, 77

So, I am at a loss to diagnose this, I would appreciate your thoughts.


Versions
Open Nebula 6.0.0.2 running on CentOS 7

Steps to reproduce:
Simply create a hook template and then create a VM, which should create a log entry.

Current results:
The hook is not executed.

Expected results:
Hook 11 should produce an error because the script does not exist.
Hook 12 should echo the arguments passed to it.

According to the documentation the HEM should log to /var/log/one/onehem.log but that doesn’t exist.

I think that the first step should be to figure out does the hem service working?

Hi Anton,

I concur and I can see that there is something listening on the port that is configured in the /etc/one/onehem-server.conf file

# grep -i endpoint /etc/one/onehem-server.conf
#   - Filesystem: ~/.one/one_endpoint and ~/.one/one_auth
# :subscriber_endpoint to subscribe for OpenNebula events must match those in
:subscriber_endpoint: 'tcp://localhost:2101'
# :replier_endpoint to send to oned hook execution results (reply to events) must
:replier_endpoint: 'tcp://localhost:2102'
# ss -ln|egrep '210[12]'
tcp    LISTEN     0      100    127.0.0.1:2101                  *:*
tcp    LISTEN     0      100    127.0.0.1:2102                  *:*
#

I’ll checked who was listening on that port, it’s:

oneadmin 25604 25574  0 Oct12 ?        00:00:08 ruby /usr/lib/one/mads/one_hm.rb -p 2101 -l 2102 -b 127.0.0.1

I looked at what the process had open but there doesn’t seem to be any log file so if it’s logging then it doesn’t hold it open all the time, or is logging in a different manner. There is no reference to HEM in /var/log messages so log messages are not going there.

One thing that is not clear to me is if one_hm is the Hook Event Manager as well as the Hook Manager. I looked in the /usr/lib/one/mads subdir but there is nothing with ‘hem’ in the name, only the one_hm.rb one so for now my assumption is that one_hm is also the event manager, but I have no proof of that.

If there is a separate process for the HEM then I can find no log errors saying that it’s failing to start.

Strange. Could you check is there an onehem-server.rb process running?

There definitely is not, which is a part of my confusion. I cannot see what starts it nor where any errors complaining about failing to start it might go.

I’ve checked the service status and there’s no errors list, nor is there any process with “hem” in the name:

# systemctl status opennebula
● opennebula.service - OpenNebula Cloud Controller Daemon
   Loaded: loaded (/etc/systemd/system/opennebula.service; enabled; vendor preset: disabled)
   Active: active (running) since Tue 2021-10-12 15:38:00 BST; 20h ago
  Process: 25565 ExecStopPost=/usr/share/one/follower_cleanup (code=exited, status=0/SUCCESS)
  Process: 25569 ExecStartPre=/usr/sbin/logrotate -f /etc/logrotate.d/opennebula -s /var/lib/one/.logrotate.status (code=exited, status=0/SUCCESS)
 Main PID: 25574 (oned)
    Tasks: 112
   CGroup: /system.slice/opennebula.service
           ├─25574 /usr/bin/oned -f
           ├─25604 ruby /usr/lib/one/mads/one_hm.rb -p 2101 -l 2102 -b 127.0.0.1
           ├─25629 ruby /usr/lib/one/mads/one_vmm_exec.rb -t 15 -r 0 firecracker
           ├─25648 ruby /usr/lib/one/mads/one_vmm_exec.rb -t 15 -r 0 kvm
           ├─25666 ruby /usr/lib/one/mads/one_vmm_exec.rb -t 15 -r 0 lxc
           ├─25684 ruby /usr/lib/one/mads/one_vmm_exec.rb -t 15 -r 0 lxd
           ├─25702 ruby /usr/lib/one/mads/one_vmm_exec.rb -t 15 -r 0 kvm
           ├─25721 ruby /usr/lib/one/mads/one_vmm_exec.rb -l deploy,shutdown,reboot,cancel,save,restore,migrate,poll,pre,post,clean,update_sg,sna...
           ├─25741 ruby /usr/lib/one/mads/one_tm.rb -t 15 -d dummy,lvm,shared,fs_lvm,qcow2,ssh,ceph,dev,vcenter,iscsi_libvirt
           ├─25765 ruby /usr/lib/one/mads/one_auth_mad.rb --authn ssh,x509,ldap,server_cipher,server_x509
           ├─25781 ruby /usr/lib/one/mads/one_datastore.rb -t 15 -d dummy,fs,lvm,ceph,dev,iscsi_libvirt,vcenter -s shared,ssh,ceph,fs_lvm,qcow2,v...
           ├─25800 ruby /usr/lib/one/mads/one_market.rb -t 15 -m http,s3,one,linuxcontainers,turnkeylinux,dockerhub
           ├─25818 ruby /usr/lib/one/mads/one_ipam.rb -t 1 -i dummy,aws,packet
           ├─25834 /usr/lib/one/mads/onemonitord -c monitord.conf
           ├─25853 ruby /usr/lib/one/mads/one_im_exec.rb -r 3 -t 15 -w 90 firecracker
           ├─25867 ruby /usr/lib/one/mads/one_im_exec.rb -r 3 -t 15 -w 90 kvm
           ├─25881 ruby /usr/lib/one/mads/one_im_exec.rb -r 3 -t 15 -w 90 lxc
           ├─25895 ruby /usr/lib/one/mads/one_im_exec.rb -r 3 -t 15 -w 90 lxd
           ├─25909 ruby /usr/lib/one/mads/one_im_exec.rb -r 3 -t 15 -w 90 qemu
           ├─25924 ruby /usr/lib/one/mads/one_im_exec.rb -l -c -t 15 -r 0 vcenter
           └─25933 ruby /var/lib/one/remotes/im/lib/vcenter_monitor.rb

Oct 12 15:37:49 on1.paris.lab systemd[1]: Starting OpenNebula Cloud Controller Daemon...
Oct 12 15:38:00 on1.paris.lab systemd[1]: Started OpenNebula Cloud Controller Daemon.
#

I even checked the journalctl output but there’s no errors listed there at all >:-(

OK, I found the daemon and got it started, however I still don’t understand why it wasn’t started on boot-up.

I found the daemon listed in the /var/lib/one/remotes/hooks/raft/vip.sh script so ran it:

systemctl start opennebula-hem

and it now seems to be working, which is great news, but I am concerned that it wasn’t set to start up on boot. I will set it to do so now but don’t have any idea what else that might break.

Could you check what is “wanted” by the opennebula service?

systemctl cat opennebula.service
#also
systemctl cat opennebula-hem.service
# systemctl cat opennebula.service
# /etc/systemd/system/opennebula.service
[Unit]
Description=OpenNebula Cloud Controller Daemon
After=syslog.target network.target remote-fs.target
After=mariadb.service mysql.service
Wants=opennebula-scheduler.service

[Service]
Type=notify
Group=oneadmin
User=oneadmin
#ExecStartPre=-/usr/bin/scl enable rh-ruby25 -- /usr/sbin/logrotate -f /etc/logrotate.d/opennebula -s /var/lib/one/.logrotate.status
#ExecStart=/usr/bin/scl enable rh-ruby25 -- /usr/bin/oned -f
#ExecStopPost=/usr/bin/scl enable rh-ruby25 -- /usr/share/one/follower_cleanup
ExecStartPre=-/usr/sbin/logrotate -f /etc/logrotate.d/opennebula -s /var/lib/one/.logrotate.status
ExecStart=/usr/bin/oned -f
ExecStopPost=/usr/share/one/follower_cleanup
PIDFile=/var/lock/one/one
StartLimitInterval=300
StartLimitBurst=3
Restart=on-failure
RestartSec=5

[Install]
WantedBy=multi-user.target
#

and…

# systemctl cat opennebula-hem.service
# /usr/lib/systemd/system/opennebula-hem.service
[Unit]
Description=OpenNebula Hook Execution Service
After=syslog.target network.target
After=opennebula.service

[Service]
Type=simple
Group=oneadmin
User=oneadmin
EnvironmentFile=-/var/run/one/ssh-agent.env
ExecStartPre=-/usr/sbin/logrotate -f /etc/logrotate.d/opennebula-hem -s /var/lib/one/.logrotate.status
ExecStart=/usr/bin/ruby /usr/lib/one/onehem/onehem-server.rb
StartLimitInterval=60
StartLimitBurst=3
Restart=on-failure
RestartSec=5

[Install]
WantedBy=multi-user.target
#

the opennebula-scheduler doesn’t have any ‘wants’ so the opennebula-hem service is effectively standalone, it just runs after the main service is started.

What are the chances for these service files to be from another opennebula version?

You could compare them against one/share/pkgs/services/systemd at one-6.0 · OpenNebula/one · GitHub

My first thought was “Definitely not!” given that it was supposed to be a fresh install, however I did compare them and the ones I have installed just to be sure and the two opennebula-hem.service instances are identical.

But opennebula.service definitely differ. In the repo:

[Unit]
Description=OpenNebula Cloud Controller Daemon
After=syslog.target network.target remote-fs.target
After=mariadb.service mysql.service
After=opennebula-ssh-agent.service
Wants=opennebula-scheduler.service opennebula-hem.service
Wants=opennebula-ssh-agent.service
Wants=opennebula-ssh-socks-cleaner.timer
Wants=opennebula-showback.timer

And you pasted:

# systemctl cat opennebula.service
# /etc/systemd/system/opennebula.service
[Unit]
Description=OpenNebula Cloud Controller Daemon
After=syslog.target network.target remote-fs.target
After=mariadb.service mysql.service
Wants=opennebula-scheduler.service

Fast forward a few hours and I have rebooted the server multiple times and the onehem-server.rb daemon starts up just fine.

The oned daemon had an issue that turned out to be DNS timing out because it couldn’t resolve anything. That was annoying since oned did not report any kind of error in is log file.

Anyway, it’s up and running now.

I cannot explain the differences between the two opennebula.service files, though oned is starting up so whatever I have is working.

Thanks for your help on this [atodorov_storpool], it has been much appreciated, now I need to work out the best way of translating the API arguments since I am not a ruby expert, nor even a newbie :slight_smile: