Onemonitord segfault on 6.0 debian 10

Hello,

I recently upgraded ONE 5.10 to 6.0 and moved whole ONE installation to separate virtual machine.

Everything has been working smoothly until today’s restart, when OpenNebula refused to start without any useful info. Last two lines are everything in logs related to the problem:

Sun Feb 13 22:42:02 2022 [Z0][IPM][I]: Starting IPAM Manager...
Sun Feb 13 22:42:02 2022 [Z0][Lis][I]: IPAM Manager started.
Sun Feb 13 22:42:05 2022 [Z0][InM][I]: Starting Information Manager...
Sun Feb 13 22:42:05 2022 [Z0][DrM][E]: Unable to start driver 'monitord': Driver initialization failed

Sun Feb 13 22:42:05 2022 [Z0][InM][E]: Error starting Information Manager: Driver initialization failed

When I tried to run ONE with oned -f under oneadmin user, I got segfault a little while after those lines appeared in log.

After while and many attempts to dig deeper, I found out that problem component is onemonitord which is segfaulting on start.

Here are last lines from strace /usr/lib/one/mads/onemonitord --config /etc/one/monitord.conf --oned-config /etc/one/oned.conf:

getrandom("\x3e", 1, GRND_NONBLOCK)     = 1
stat("/etc/gnutls/default-priorities", 0x7ffc957b0240) = -1 ENOENT (No such file or directory)
futex(0x7f621d87007c, FUTEX_WAKE_PRIVATE, 2147483647) = 0
futex(0x7f621d870088, FUTEX_WAKE_PRIVATE, 2147483647) = 0
--- SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_MAPERR, si_addr=NULL} ---
+++ killed by SIGSEGV (core dumped) +++
[1]    15215 segmentation fault (core dumped)  strace /usr/lib/one/mads/onemonitord --config /etc/one/monitord.conf

and output form coredumpctl, probably not useful without debugging libs

❯ coredumpctl info                                                                                                                                                                                      :(
           PID: 15450 (onemonitord)
           UID: 9869 (oneadmin)
           GID: 9869 (oneadmin)
        Signal: 11 (SEGV)
     Timestamp: Sun 2022-02-13 23:38:32 CET (4s ago)
  Command Line: /usr/lib/one/mads/onemonitord --config /etc/one/monitord.conf --oned-config /etc/one/oned.conf
    Executable: /usr/lib/one/mads/onemonitord
 Control Group: /system.slice/ssh.service
          Unit: ssh.service
         Slice: system.slice
       Boot ID: cade14f3afe640dfa31ffca7d45344b1
    Machine ID: 70c7672c5ce241f483e52d720c3f6158
      Hostname: urc-a
       Storage: /var/lib/systemd/coredump/core.onemonitord.9869.cade14f3afe640dfa31ffca7d45344b1.15450.1644791912000000.lz4
       Message: Process 15450 (onemonitord) of user 9869 dumped core.

                Stack trace of thread 15450:
                #0  0x00007f544f6e0206 n/a (libc.so.6)
                #1  0x00007f544f941e44 _ZNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEE6assignEPKc (libstdc++.so.6)
                #2  0x0000557ead4dba68 n/a (onemonitord)
                #3  0x00007f544f66c09b __libc_start_main (libc.so.6)
                #4  0x0000557ead4de80a n/a (onemonitord)

After this discovery, I commented lines regarding monitord IM_MAD in oned.conf and ONE now at least started, but without monitord of course.

Any idea is welcomed, I’m kind of stuck and don’t know what else to try. I even tried to reinstall all opennebula packages from distro, but no improvement.

Thanks

It seems there is something wrong in the onemonitord.conf or oned.conf file. Can you please share the monitord.conf and oned.conf? Make sure to delete the DB user and password from the oned.conf before sending.

Or you can install opennebula-dbgsym (on Debian/Ubuntu) or openenbula-debuginfo (on AlmaLinux/RHEL), then segfault again and the coredumpctl should give more useful info.

Long time since debugging anything, here is bt full, let me know if I can provide something more.

Reading symbols from /usr/lib/one/mads/onemonitord...Reading symbols from /usr/lib/debug/.build-id/ff/fa8bf482b581e17b67f88534ebabbb1dcfdb31.debug...done.
done.
[New LWP 4787]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `/usr/lib/one/mads/onemonitord --config /etc/one/monitord.conf --oned-config /et'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  __strlen_sse2 () at ../sysdeps/x86_64/multiarch/../strlen.S:120
120     ../sysdeps/x86_64/multiarch/../strlen.S: No such file or directory.
(gdb) bt
#0  __strlen_sse2 () at ../sysdeps/x86_64/multiarch/../strlen.S:120
#1  0x00007f106e028e44 in std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::assign(char const*) () from /lib/x86_64-linux-gnu/libstdc++.so.6
#2  0x00005573ee7f4a68 in std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::operator= (__s=<optimized out>, this=0x7ffd26701d90) at /usr/include/c++/8/bits/basic_string.h:703
#3  main (argc=<optimized out>, argv=<optimized out>) at src/monitor/src/monitor/onemonitord.cc:124
(gdb) bt full
#0  __strlen_sse2 () at ../sysdeps/x86_64/multiarch/../strlen.S:120
No locals.
#1  0x00007f106e028e44 in std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::assign(char const*) () from /lib/x86_64-linux-gnu/libstdc++.so.6
No symbol table info available.
#2  0x00005573ee7f4a68 in std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::operator= (__s=<optimized out>, this=0x7ffd26701d90) at /usr/include/c++/8/bits/basic_string.h:703
No locals.
#3  main (argc=<optimized out>, argv=<optimized out>) at src/monitor/src/monitor/onemonitord.cc:124
        argv_1 = "--config"
        _argv = std::vector of length 1, capacity 1 = {"--config"}
        _argv_c = 0x5573f021a030
        opt = <optimized out>
        _argc = 2
        long_options = {{name = 0x5573ee85cd39 "version", has_arg = 0, flag = 0x0, val = 118}, {name = 0x5573ee858230 "help", has_arg = 0, flag = 0x0, val = 104}, {name = 0x5573ee85823a "config", has_arg = 0, flag = 0x0, val = 99}, {name = 0x5573ee858235 "oned-config", has_arg = 0, flag = 0x0, val = 111}, {name = 0x0, has_arg = 0, flag = 0x0, val = 0}}
        long_index = 2
        config = "monitord.conf"
        oned_config = "oned.conf"

Configs attached

oned.conf (54.9 KB)
monitord.conf (8.2 KB)

There is a bug parsing onemonitord arguments. But this bug should appear only if you start onemonitord manually from command line. You are using default arguments, so you can simply run onemonitord without any argument.

Then you should see in /var/log/one/monitor.log a reasonable error message.

I haven’t seen anything odd it your config files and I was able to start OpenNebula with your config files.

Thanks for fast reply, immediate problem resolved.

 /usr/lib/one/mads/onemonitord
Could not open connect to database server: User oneadmin already has more than 'max_user_connections' active connections

This is expected, as I didn’t realize that monitord uses separate connections from oned and I wanted to provide more resources to oned just because I could… I just lowered connections in oned, uncommented monitord section and everything is working fine.

Btw. I added arguments because it wasn’t working without them on my first attempts, probably because oned wasn’t running then.

Then you should see in /var/log/one/monitor.log a reasonable error message.

That is unfortunately remaining problem, as, at least on 6.0, there is no message in monitor.log at all. It is completely empty in this case and oned is segfaulting at start as described in original log.

For me, this is resolved, but it would be great to take a look at why it isn’t logging as it could save others from hitting similar problems.

Thanks again for your fast help!