Opennebula.service unecpectectedly dies

Hey,

I have configured one 5.4 as a three node HA cluster with the new raft hook. Works perfectly! Now I have the problem, that the opennebula.service unexpectedly dies after about 24 hours on the three nodes.

Jul 26 22:09:22 sun01 systemd: opennebula.service: main process exited, code=killed, status=6/ABRT
Jul 26 22:09:22 sun01 kill: Usage:
Jul 26 22:09:22 sun01 kill: kill [options] <pid|name> [...]
Jul 26 22:09:22 sun01 kill: Options:
Jul 26 22:09:22 sun01 kill: -a, --all              do not restrict the name-to-pid conversion to processes
Jul 26 22:09:22 sun01 kill: with the same uid as the present process
Jul 26 22:09:22 sun01 kill: -s, --signal <sig>     send specified signal
Jul 26 22:09:22 sun01 kill: -q, --queue <sig>      use sigqueue(2) rather than kill(2)
Jul 26 22:09:22 sun01 kill: -p, --pid              print pids without signaling them
Jul 26 22:09:22 sun01 kill: -l, --list [=<signal>] list signal names, or convert one to a name
Jul 26 22:09:22 sun01 kill: -L, --table            list signal names and numbers
Jul 26 22:09:22 sun01 kill: -h, --help     display this help and exit
Jul 26 22:09:22 sun01 kill: -V, --version  output version information and exit
Jul 26 22:09:22 sun01 kill: For more details see kill(1).
Jul 26 22:09:22 sun01 systemd: opennebula.service: control process exited, code=exited status=1
Jul 26 22:09:22 sun01 systemd: Stopping OpenNebula Cloud Scheduler Daemon...
Jul 26 22:09:22 sun01 systemd: Stopped OpenNebula Cloud Scheduler Daemon.
Jul 26 22:09:22 sun01 systemd: Stopped OpenNebula Cloud Controller Daemon.
Jul 26 22:09:22 sun01 systemd: Unit opennebula.service entered failed state.

I am using latest centos 7.

I have checked if there are some other processes which might cause this problem but I cannot find something. There are even no other users which might kill the process.

Is anyone else experiencing such a problem? Do you have some advise how to debug this?

Thank you.

It will be great if you could enable the core output in systemd. Maybe you have already (i.e. coredumpctl list oned)

Thanks ruben. I am on it. Will get back to you with the dumps…

Here you have the first dump:

http://mirror.23media.de/dumps/core-oned-sig6-user9869-group9869-pid7964-time1501153443

and here comes another one from another node:

http://mirror.23media.de/dumps/core-oned-sig6-user9869-group9869-pid2789-time1501160538

Yesterday at 18:47 the leader dumped and shortly after that the highest voted follower (the new leader).

18:47: http://mirror.23media.de/dumps/core-oned-sig6-user9869-group9869-pid2790-time1501174060
19:05: http://mirror.23media.de/dumps/core-oned-sig6-user9869-group9869-pid23206-time1501175102

If you need some more information just get back to me. I will solve this problem with restart on-abnormal until we have a fix for this.

The program stops because a SIGABRT in here.

Thread 1 (Thread 0x7ff349ffb700 (LWP 19242)):
#0  0x00007ff3a76df1d7 in sigismember () from /lib64/libc.so.6
#1  0x00007ff3a76e08c8 in _quicksort () from /lib64/libc.so.6
#2  0x0000000000000000 in ?? ()

I’ll try to find out the reason for this, currently have no clue. Could you open the core in your machine:

$ gdb `which oned` <path_to_core>
> thread apply all bt 

just to check if you have more information, as the libc library in my install differs from yours.

Sure. Here you are. Latest coredump from today 11:04:

[root@sun02 tmp]# gdb `which oned` /tmp/core-oned-sig6-user9869-group9869-pid9739-time1501232690
GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-94.el7
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /usr/bin/oned...Reading symbols from /usr/bin/oned...(no debugging symbols found)...done.
(no debugging symbols found)...done.
[New LWP 28842]
[New LWP 28843]
[New LWP 9789]
[New LWP 9756]
[New LWP 9784]
[New LWP 9777]
[New LWP 9775]
[New LWP 9790]
[New LWP 9781]
[New LWP 9783]
[New LWP 9755]
[New LWP 9739]
[New LWP 9782]
[New LWP 9778]
[New LWP 9788]
[New LWP 9774]
[New LWP 9785]
[New LWP 9779]
[New LWP 9791]
[New LWP 9793]
[New LWP 9787]
[New LWP 9786]
[New LWP 28841]
[New LWP 28840]
[New LWP 9994]
[New LWP 9993]
[New LWP 9792]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `/usr/bin/oned -f'.
Program terminated with signal 6, Aborted.
#0  0x00007f5b2f5cb1d7 in raise () from /lib64/libc.so.6
Missing separate debuginfos, use: debuginfo-install opennebula-server-5.4.0-1.x86_64
(gdb) thread apply all bt

Thread 27 (Thread 0x7f5b09ffb700 (LWP 9792)):
#0  0x00007f5b2f684bd3 in select () from /lib64/libc.so.6
#1  0x000000000050f488 in MadManager::listener() ()
#2  0x000000000050eeb0 in mad_manager_listener ()
#3  0x00007f5b3017fdc5 in start_thread () from /lib64/libpthread.so.0
#4  0x00007f5b2f68d76d in clone () from /lib64/libc.so.6

Thread 26 (Thread 0x7f5b08ff9700 (LWP 9993)):
#0  0x00007f5b301836d5 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00000000005a314d in ActionManager::loop(timespec&, ActionRequest const&) ()
#2  0x00000000004516b1 in ActionManager::loop() ()
#3  0x0000000000462f27 in rm_action_loop ()
#4  0x00007f5b3017fdc5 in start_thread () from /lib64/libpthread.so.0
#5  0x00007f5b2f68d76d in clone () from /lib64/libc.so.6

Thread 25 (Thread 0x7f5ad3fff700 (LWP 9994)):
#0  0x00007f5b2f682e2d in poll () from /lib64/libc.so.6
#1  0x000000000060326d in chanSwitchAccept ()
#2  0x00000000005fa642 in ChanSwitchAccept ()
#3  0x0000000000600bc5 in ServerRun ()
#4  0x00000000005f3b8e in xmlrpc_c::setupSignalsAndRunAbyss(TServer*) ()
#5  0x00000000005f3bc8 in xmlrpc_c::serverAbyss_impl::run() ()
#6  0x00000000005f3f5b in xmlrpc_c::serverAbyss::run() ()
#7  0x00000000004630c9 in rm_xml_server_loop ()
#8  0x00007f5b3017fdc5 in start_thread () from /lib64/libpthread.so.0
#9  0x00007f5b2f68d76d in clone () from /lib64/libc.so.6

Thread 24 (Thread 0x7f5ad37fe700 (LWP 28840)):
#0  0x00007f5b30183a82 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00000000005e354d in ReplicaThread::do_replication() ()
#2  0x00000000005e343d in replication_thread ()
#3  0x00007f5b3017fdc5 in start_thread () from /lib64/libpthread.so.0
#4  0x00007f5b2f68d76d in clone () from /lib64/libc.so.6

Thread 23 (Thread 0x7f5ad2ffd700 (LWP 28841)):
#0  0x00007f5b30183a82 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00000000005e354d in ReplicaThread::do_replication() ()
#2  0x00000000005e343d in replication_thread ()
#3  0x00007f5b3017fdc5 in start_thread () from /lib64/libpthread.so.0
#4  0x00007f5b2f68d76d in clone () from /lib64/libc.so.6

Thread 22 (Thread 0x7f5b11ffb700 (LWP 9786)):
#0  0x00007f5b2f684bd3 in select () from /lib64/libc.so.6
#1  0x000000000050f488 in MadManager::listener() ()
#2  0x000000000050eeb0 in mad_manager_listener ()
#3  0x00007f5b3017fdc5 in start_thread () from /lib64/libpthread.so.0
#4  0x00007f5b2f68d76d in clone () from /lib64/libc.so.6

Thread 21 (Thread 0x7f5b117fa700 (LWP 9787)):
#0  0x00007f5b30183a82 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00000000005a30f5 in ActionManager::loop(timespec&, ActionRequest const&) ()
#2  0x000000000043aa17 in ActionManager::loop(long) ()
#3  0x0000000000500c4c in authm_action_loop ()
#4  0x00007f5b3017fdc5 in start_thread () from /lib64/libpthread.so.0
#5  0x00007f5b2f68d76d in clone () from /lib64/libc.so.6

Thread 20 (Thread 0x7f5b097fa700 (LWP 9793)):
#0  0x00007f5b30183a82 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00000000005a30f5 in ActionManager::loop(timespec&, ActionRequest const&) ()
#2  0x000000000043aa17 in ActionManager::loop(long) ()
#3  0x00000000005cf879 in ipamm_action_loop ()
#4  0x00007f5b3017fdc5 in start_thread () from /lib64/libpthread.so.0
#5  0x00007f5b2f68d76d in clone () from /lib64/libc.so.6

Thread 19 (Thread 0x7f5b0a7fc700 (LWP 9791)):
#0  0x00007f5b30183a82 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00000000005a30f5 in ActionManager::loop(timespec&, ActionRequest const&) ()
#2  0x000000000043aa17 in ActionManager::loop(long) ()
#3  0x00000000005cc059 in marketplace_action_loop ()
#4  0x00007f5b3017fdc5 in start_thread () from /lib64/libpthread.so.0
#5  0x00007f5b2f68d76d in clone () from /lib64/libc.so.6

Thread 18 (Thread 0x7f5b28c07700 (LWP 9779)):
#0  0x00007f5b301836d5 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00000000005a314d in ActionManager::loop(timespec&, ActionRequest const&) ()
#2  0x00000000004516b1 in ActionManager::loop() ()
#3  0x0000000000450e5e in lcm_action_loop ()
#4  0x00007f5b3017fdc5 in start_thread () from /lib64/libpthread.so.0
#5  0x00007f5b2f68d76d in clone () from /lib64/libc.so.6

Thread 17 (Thread 0x7f5b127fc700 (LWP 9785)):
#0  0x00007f5b301836d5 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00000000005a314d in ActionManager::loop(timespec&, ActionRequest const&) ()
#2  0x00000000004516b1 in ActionManager::loop() ()
#3  0x00000000004ccd38 in dm_action_loop ()
#4  0x00007f5b3017fdc5 in start_thread () from /lib64/libpthread.so.0
#5  0x00007f5b2f68d76d in clone () from /lib64/libc.so.6

Thread 16 (Thread 0x7f5b2ac0b700 (LWP 9774)):
#0  0x00007f5b30183a82 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00000000005a30f5 in ActionManager::loop(timespec&, ActionRequest const&) ()
#2  0x00000000005de14f in ActionManager::loop(timespec&) ()
#3  0x00000000005da2d8 in raft_manager_loop ()
#4  0x00007f5b3017fdc5 in start_thread () from /lib64/libpthread.so.0
#5  0x00007f5b2f68d76d in clone () from /lib64/libc.so.6

Thread 15 (Thread 0x7f5b10ff9700 (LWP 9788)):
#0  0x00007f5b2f684bd3 in select () from /lib64/libc.so.6
#1  0x000000000050f488 in MadManager::listener() ()
#2  0x000000000050eeb0 in mad_manager_listener ()
#3  0x00007f5b3017fdc5 in start_thread () from /lib64/libpthread.so.0
#4  0x00007f5b2f68d76d in clone () from /lib64/libc.so.6

Thread 14 (Thread 0x7f5b29408700 (LWP 9778)):
#0  0x00007f5b30183a82 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00000000005a30f5 in ActionManager::loop(timespec&, ActionRequest const&) ()
#2  0x000000000043aa17 in ActionManager::loop(long) ()
#3  0x000000000042f613 in vmm_action_loop ()
#4  0x00007f5b3017fdc5 in start_thread () from /lib64/libpthread.so.0
#5  0x00007f5b2f68d76d in clone () from /lib64/libc.so.6

Thread 13 (Thread 0x7f5b13fff700 (LWP 9782)):
#0  0x00007f5b30183a82 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00000000005a30f5 in ActionManager::loop(timespec&, ActionRequest const&) ()
#2  0x000000000043aa17 in ActionManager::loop(long) ()
#3  0x000000000045f2f5 in im_action_loop ()
#4  0x00007f5b3017fdc5 in start_thread () from /lib64/libpthread.so.0
#5  0x00007f5b2f68d76d in clone () from /lib64/libc.so.6

Thread 12 (Thread 0x7f5b31d91840 (LWP 9739)):
#0  0x00007f5b30187101 in sigwait () from /lib64/libpthread.so.0
#1  0x0000000000414187 in Nebula::start(bool) ()
#2  0x000000000040c9f2 in oned_main() ()
#3  0x000000000040cd18 in main ()

Thread 11 (Thread 0x7f5b2be20700 (LWP 9755)):
#0  0x00007f5b2f684bd3 in select () from /lib64/libc.so.6
#1  0x000000000050f488 in MadManager::listener() ()
#2  0x000000000050eeb0 in mad_manager_listener ()
#3  0x00007f5b3017fdc5 in start_thread () from /lib64/libpthread.so.0
#4  0x00007f5b2f68d76d in clone () from /lib64/libc.so.6

Thread 10 (Thread 0x7f5b137fe700 (LWP 9783)):
#0  0x00007f5b2f684bd3 in select () from /lib64/libc.so.6
#1  0x000000000050f488 in MadManager::listener() ()
#2  0x000000000050eeb0 in mad_manager_listener ()
#3  0x00007f5b3017fdc5 in start_thread () from /lib64/libpthread.so.0
#4  0x00007f5b2f68d76d in clone () from /lib64/libc.so.6

Thread 9 (Thread 0x7f5b0bfff700 (LWP 9781)):
#0  0x00007f5b2f684bd3 in select () from /lib64/libc.so.6
#1  0x000000000050f488 in MadManager::listener() ()
#2  0x000000000050eeb0 in mad_manager_listener ()
#3  0x00007f5b3017fdc5 in start_thread () from /lib64/libpthread.so.0
#4  0x00007f5b2f68d76d in clone () from /lib64/libc.so.6

Thread 8 (Thread 0x7f5b0affd700 (LWP 9790)):
#0  0x00007f5b2f684bd3 in select () from /lib64/libc.so.6
#1  0x000000000050f488 in MadManager::listener() ()
#2  0x000000000050eeb0 in mad_manager_listener ()
#3  0x00007f5b3017fdc5 in start_thread () from /lib64/libpthread.so.0
#4  0x00007f5b2f68d76d in clone () from /lib64/libc.so.6

Thread 7 (Thread 0x7f5b2a40a700 (LWP 9775)):
#0  0x00007f5b301836d5 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00000000005a314d in ActionManager::loop(timespec&, ActionRequest const&) ()
#2  0x00000000004516b1 in ActionManager::loop() ()
#3  0x00000000005e46b9 in frm_loop ()
#4  0x00007f5b3017fdc5 in start_thread () from /lib64/libpthread.so.0
#5  0x00007f5b2f68d76d in clone () from /lib64/libc.so.6

Thread 6 (Thread 0x7f5b29c09700 (LWP 9777)):
#0  0x00007f5b2f684bd3 in select () from /lib64/libc.so.6
#1  0x000000000050f488 in MadManager::listener() ()
#2  0x000000000050eeb0 in mad_manager_listener ()
#3  0x00007f5b3017fdc5 in start_thread () from /lib64/libpthread.so.0
#4  0x00007f5b2f68d76d in clone () from /lib64/libc.so.6

Thread 5 (Thread 0x7f5b12ffd700 (LWP 9784)):
#0  0x00007f5b301836d5 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00000000005a314d in ActionManager::loop(timespec&, ActionRequest const&) ()
#2  0x00000000004516b1 in ActionManager::loop() ()
#3  0x00000000004d5c4a in tm_action_loop ()
#4  0x00007f5b3017fdc5 in start_thread () from /lib64/libpthread.so.0
#5  0x00007f5b2f68d76d in clone () from /lib64/libc.so.6

Thread 4 (Thread 0x7f5b2b40c700 (LWP 9756)):
#0  0x00007f5b301836d5 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00000000005a314d in ActionManager::loop(timespec&, ActionRequest const&) ()
#2  0x00000000004516b1 in ActionManager::loop() ()
#3  0x00000000005a1e24 in hm_action_loop ()
#4  0x00007f5b3017fdc5 in start_thread () from /lib64/libpthread.so.0
#5  0x00007f5b2f68d76d in clone () from /lib64/libc.so.6

Thread 3 (Thread 0x7f5b0b7fe700 (LWP 9789)):
#0  0x00007f5b30183a82 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00000000005a30f5 in ActionManager::loop(timespec&, ActionRequest const&) ()
#2  0x000000000043aa17 in ActionManager::loop(long) ()
#3  0x00000000005214dd in image_action_loop ()
#4  0x00007f5b3017fdc5 in start_thread () from /lib64/libpthread.so.0
#5  0x00007f5b2f68d76d in clone () from /lib64/libc.so.6

Thread 2 (Thread 0x7f5ad1ffb700 (LWP 28843)):
#0  0x00007f5b2f61c870 in strdup () from /lib64/libc.so.6
#1  0x00007f5b30b331b7 in setstropt () from /lib64/libcurl.so.4
#2  0x00007f5b30b33ed5 in Curl_init_userdefined () from /lib64/libcurl.so.4
#3  0x00007f5b30b33fae in Curl_open () from /lib64/libcurl.so.4
#4  0x00007f5b30b43164 in curl_easy_init () from /lib64/libcurl.so.4
#5  0x00000000005ef6cd in create ()
#6  0x00000000005ed1bb in xmlrpc_c::clientXmlTransport_curl::initialize(xmlrpc_c::clientXmlTransport_curl::constrOpt const&) ()
#7  0x00000000005ed2fc in xmlrpc_c::clientXmlTransport_curl::clientXmlTransport_curl(xmlrpc_c::clientXmlTransport_curl::constrOpt const&) ()
#8  0x00000000005b07c6 in Client::call(std::string const&, std::string const&, xmlrpc_c::paramList const&, unsigned int, xmlrpc_c::value*, std::string&) ()
#9  0x00000000005dcb96 in RaftManager::xmlrpc_replicate_log(int, LogDBRecord*, bool&, unsigned int&, std::string&) ()
#10 0x00000000005e3d05 in HeartBeatThread::replicate() ()
#11 0x00000000005e35b0 in ReplicaThread::do_replication() ()
#12 0x00000000005e343d in replication_thread ()
#13 0x00007f5b3017fdc5 in start_thread () from /lib64/libpthread.so.0
#14 0x00007f5b2f68d76d in clone () from /lib64/libc.so.6

Thread 1 (Thread 0x7f5ad27fc700 (LWP 28842)):
#0  0x00007f5b2f5cb1d7 in raise () from /lib64/libc.so.6
#1  0x00007f5b2f5cc8c8 in abort () from /lib64/libc.so.6
#2  0x00007f5b2fecf9d5 in __gnu_cxx::__verbose_terminate_handler() () from /lib64/libstdc++.so.6
#3  0x00007f5b2fecd946 in ?? () from /lib64/libstdc++.so.6
#4  0x00007f5b2fecd973 in std::terminate() () from /lib64/libstdc++.so.6
#5  0x00007f5b2fecdb93 in __cxa_throw () from /lib64/libstdc++.so.6
#6  0x000000000060948e in (anonymous namespace)::throwIfError ()
#7  0x000000000060c640 in xmlrpc_c::cNewStringWrapper::cNewStringWrapper(std::string, xmlrpc_c::value_string::nlCode) ()
#8  0x000000000060ad68 in xmlrpc_c::value_string::value_string(std::string const&) ()
#9  0x00000000005dc74d in RaftManager::xmlrpc_replicate_log(int, LogDBRecord*, bool&, unsigned int&, std::string&) ()
#10 0x00000000005e3d05 in HeartBeatThread::replicate() ()
#11 0x00000000005e35b0 in ReplicaThread::do_replication() ()
#12 0x00000000005e343d in replication_thread ()
#13 0x00007f5b3017fdc5 in start_thread () from /lib64/libpthread.so.0
#14 0x00007f5b2f68d76d in clone () from /lib64/libc.so.6
(gdb)

Ok, that makes a bit more sense. The libxmlrpc is raising an error while creating an argument string while sending a heartbeat this can be either the oneadmin session string (always the same, so if that were a problem it should never work, this could be forexample for non UTF8 chars, but I think this is not the case) or the command which is just a “”.

So could you check the memory of the process, just monitor if the RSS increases. This should be in the leader, as is the only process sending the heartbeats.

Also could you check another core just to verify that all of them are failing becasue of:

#0  0x00007f5b2f5cb1d7 in raise () from /lib64/libc.so.6
#1  0x00007f5b2f5cc8c8 in abort () from /lib64/libc.so.6
#2  0x00007f5b2fecf9d5 in __gnu_cxx::__verbose_terminate_handler() () from /lib64/libstdc++.so.6
#3  0x00007f5b2fecd946 in ?? () from /lib64/libstdc++.so.6
#4  0x00007f5b2fecd973 in std::terminate() () from /lib64/libstdc++.so.6
#5  0x00007f5b2fecdb93 in __cxa_throw () from /lib64/libstdc++.so.6
#6  0x000000000060948e in (anonymous namespace)::throwIfError ()
#7  0x000000000060c640 in xmlrpc_c::cNewStringWrapper::cNewStringWrapper(std::string, xmlrpc_c::value_string::nlCode) ()
#8  0x000000000060ad68 in xmlrpc_c::value_string::value_string(std::string const&) ()
#9  0x00000000005dc74d in RaftManager::xmlrpc_replicate_log(int, LogDBRecord*, bool&, unsigned int&, std::string&) ()
#10 0x00000000005e3d05 in HeartBeatThread::replicate() ()

It could be also the error returned by the follower, in the previous core. Could you please do

> gdb `which oned` <path_to_core>
thread 1
fr 9
list
p values
p lr

The output to your your latest comment:

[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `/usr/bin/oned -f'.
Program terminated with signal 6, Aborted.
#0  0x00007f5b2f5cb1d7 in raise () from /lib64/libc.so.6
Missing separate debuginfos, use: debuginfo-install opennebula-server-5.4.0-1.x86_64
(gdb) thread 1
[Switching to thread 1 (Thread 0x7f5ad27fc700 (LWP 28842))]
#0  0x00007f5b2f5cb1d7 in raise () from /lib64/libc.so.6
(gdb) fr 9
#9  0x00000000005dc74d in RaftManager::xmlrpc_replicate_log(int, LogDBRecord*, bool&, unsigned int&, std::string&) ()
(gdb) list
No symbol table is loaded.  Use the "file" command.
(gdb) p values
No symbol table is loaded.  Use the "file" command.
(gdb) p lr
No symbol table is loaded.  Use the "file" command.
(gdb)

And the output of gdb of another dump on another host:

[root@sun01 ~]# gdb `which oned` /tmp/core-oned-sig6-user9869-group9869-pid
core-oned-sig6-user9869-group9869-pid12790-time1501230458  core-oned-sig6-user9869-group9869-pid23206-time1501175102  core-oned-sig6-user9869-group9869-pid7964-time1501153443
[root@sun01 ~]# gdb `which oned` /tmp/core-oned-sig6-user9869-group9869-pid12790-time1501230458
GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-94.el7
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /usr/bin/oned...Reading symbols from /usr/bin/oned...(no debugging symbols found)...done.
(no debugging symbols found)...done.
[New LWP 29550]
[New LWP 29551]
[New LWP 12825]
[New LWP 12831]
[New LWP 12826]
[New LWP 12833]
[New LWP 12824]
[New LWP 12805]
[New LWP 12838]
[New LWP 12828]
[New LWP 12806]
[New LWP 12832]
[New LWP 12790]
[New LWP 12834]
[New LWP 29548]
[New LWP 13042]
[New LWP 12830]
[New LWP 12837]
[New LWP 12835]
[New LWP 12839]
[New LWP 12836]
[New LWP 12841]
[New LWP 12840]
[New LWP 13041]
[New LWP 12843]
[New LWP 12842]
[New LWP 29549]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `/usr/bin/oned -f'.
Program terminated with signal 6, Aborted.
#0  0x00007f38dae141d7 in raise () from /lib64/libc.so.6
Missing separate debuginfos, use: debuginfo-install opennebula-server-5.4.0-1.x86_64
(gdb) thread apply all bt

Thread 27 (Thread 0x7f387a7fc700 (LWP 29549)):
#0  0x00007f38db9cca82 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00000000005e354d in ReplicaThread::do_replication() ()
#2  0x00000000005e343d in replication_thread ()
#3  0x00007f38db9c8dc5 in start_thread () from /lib64/libpthread.so.0
#4  0x00007f38daed676d in clone () from /lib64/libc.so.6

Thread 26 (Thread 0x7f38ad7fa700 (LWP 12842)):
#0  0x00007f38daecdbd3 in select () from /lib64/libc.so.6
#1  0x000000000050f488 in MadManager::listener() ()
#2  0x000000000050eeb0 in mad_manager_listener ()
#3  0x00007f38db9c8dc5 in start_thread () from /lib64/libpthread.so.0
#4  0x00007f38daed676d in clone () from /lib64/libc.so.6

Thread 25 (Thread 0x7f38acff9700 (LWP 12843)):
#0  0x00007f38db9cca82 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00000000005a30f5 in ActionManager::loop(timespec&, ActionRequest const&) ()
#2  0x000000000043aa17 in ActionManager::loop(long) ()
#3  0x00000000005cf879 in ipamm_action_loop ()
#4  0x00007f38db9c8dc5 in start_thread () from /lib64/libpthread.so.0
#5  0x00007f38daed676d in clone () from /lib64/libc.so.6

Thread 24 (Thread 0x7f387bfff700 (LWP 13041)):
#0  0x00007f38db9cc6d5 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00000000005a314d in ActionManager::loop(timespec&, ActionRequest const&) ()
#2  0x00000000004516b1 in ActionManager::loop() ()
#3  0x0000000000462f27 in rm_action_loop ()
#4  0x00007f38db9c8dc5 in start_thread () from /lib64/libpthread.so.0
#5  0x00007f38daed676d in clone () from /lib64/libc.so.6

Thread 23 (Thread 0x7f38ae7fc700 (LWP 12840)):
#0  0x00007f38daecdbd3 in select () from /lib64/libc.so.6
#1  0x000000000050f488 in MadManager::listener() ()
#2  0x000000000050eeb0 in mad_manager_listener ()
#3  0x00007f38db9c8dc5 in start_thread () from /lib64/libpthread.so.0
#4  0x00007f38daed676d in clone () from /lib64/libc.so.6

Thread 22 (Thread 0x7f38adffb700 (LWP 12841)):
#0  0x00007f38db9cca82 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00000000005a30f5 in ActionManager::loop(timespec&, ActionRequest const&) ()
#2  0x000000000043aa17 in ActionManager::loop(long) ()
#3  0x00000000005cc059 in marketplace_action_loop ()
#4  0x00007f38db9c8dc5 in start_thread () from /lib64/libpthread.so.0
#5  0x00007f38daed676d in clone () from /lib64/libc.so.6

Thread 21 (Thread 0x7f38bcff9700 (LWP 12836)):
#0  0x00007f38daecdbd3 in select () from /lib64/libc.so.6
#1  0x000000000050f488 in MadManager::listener() ()
#2  0x000000000050eeb0 in mad_manager_listener ()
#3  0x00007f38db9c8dc5 in start_thread () from /lib64/libpthread.so.0
#4  0x00007f38daed676d in clone () from /lib64/libc.so.6

Thread 20 (Thread 0x7f38aeffd700 (LWP 12839)):
#0  0x00007f38db9cca82 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00000000005a30f5 in ActionManager::loop(timespec&, ActionRequest const&) ()
#2  0x000000000043aa17 in ActionManager::loop(long) ()
#3  0x00000000005214dd in image_action_loop ()
#4  0x00007f38db9c8dc5 in start_thread () from /lib64/libpthread.so.0
#5  0x00007f38daed676d in clone () from /lib64/libc.so.6

Thread 19 (Thread 0x7f38bd7fa700 (LWP 12835)):
#0  0x00007f38db9cc6d5 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00000000005a314d in ActionManager::loop(timespec&, ActionRequest const&) ()
#2  0x00000000004516b1 in ActionManager::loop() ()
#3  0x00000000004ccd38 in dm_action_loop ()
#4  0x00007f38db9c8dc5 in start_thread () from /lib64/libpthread.so.0
#5  0x00007f38daed676d in clone () from /lib64/libc.so.6

Thread 18 (Thread 0x7f38affff700 (LWP 12837)):
#0  0x00007f38db9cca82 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00000000005a30f5 in ActionManager::loop(timespec&, ActionRequest const&) ()
#2  0x000000000043aa17 in ActionManager::loop(long) ()
#3  0x0000000000500c4c in authm_action_loop ()
#4  0x00007f38db9c8dc5 in start_thread () from /lib64/libpthread.so.0
#5  0x00007f38daed676d in clone () from /lib64/libc.so.6

Thread 17 (Thread 0x7f38bffff700 (LWP 12830)):
#0  0x00007f38db9cc6d5 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00000000005a314d in ActionManager::loop(timespec&, ActionRequest const&) ()
#2  0x00000000004516b1 in ActionManager::loop() ()
#3  0x0000000000450e5e in lcm_action_loop ()
#4  0x00007f38db9c8dc5 in start_thread () from /lib64/libpthread.so.0
#5  0x00007f38daed676d in clone () from /lib64/libc.so.6

Thread 16 (Thread 0x7f387b7fe700 (LWP 13042)):
#0  0x00007f38daecbe2d in poll () from /lib64/libc.so.6
#1  0x000000000060326d in chanSwitchAccept ()
#2  0x00000000005fa642 in ChanSwitchAccept ()
#3  0x0000000000600bc5 in ServerRun ()
#4  0x00000000005f3b8e in xmlrpc_c::setupSignalsAndRunAbyss(TServer*) ()
#5  0x00000000005f3bc8 in xmlrpc_c::serverAbyss_impl::run() ()
#6  0x00000000005f3f5b in xmlrpc_c::serverAbyss::run() ()
#7  0x00000000004630c9 in rm_xml_server_loop ()
#8  0x00007f38db9c8dc5 in start_thread () from /lib64/libpthread.so.0
#9  0x00007f38daed676d in clone () from /lib64/libc.so.6

Thread 15 (Thread 0x7f387affd700 (LWP 29548)):
#0  0x00007f38db9cca82 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00000000005e354d in ReplicaThread::do_replication() ()
#2  0x00000000005e343d in replication_thread ()
#3  0x00007f38db9c8dc5 in start_thread () from /lib64/libpthread.so.0
#4  0x00007f38daed676d in clone () from /lib64/libc.so.6

Thread 14 (Thread 0x7f38bdffb700 (LWP 12834)):
#0  0x00007f38db9cc6d5 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00000000005a314d in ActionManager::loop(timespec&, ActionRequest const&) ()
#2  0x00000000004516b1 in ActionManager::loop() ()
#3  0x00000000004d5c4a in tm_action_loop ()
#4  0x00007f38db9c8dc5 in start_thread () from /lib64/libpthread.so.0
#5  0x00007f38daed676d in clone () from /lib64/libc.so.6

Thread 13 (Thread 0x7f38dd5da840 (LWP 12790)):
#0  0x00007f38db9d0101 in sigwait () from /lib64/libpthread.so.0
#1  0x0000000000414187 in Nebula::start(bool) ()
#2  0x000000000040c9f2 in oned_main() ()
#3  0x000000000040cd18 in main ()

Thread 12 (Thread 0x7f38beffd700 (LWP 12832)):
#0  0x00007f38db9cca82 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00000000005a30f5 in ActionManager::loop(timespec&, ActionRequest const&) ()
#2  0x000000000043aa17 in ActionManager::loop(long) ()
#3  0x000000000045f2f5 in im_action_loop ()
#4  0x00007f38db9c8dc5 in start_thread () from /lib64/libpthread.so.0
#5  0x00007f38daed676d in clone () from /lib64/libc.so.6

Thread 11 (Thread 0x7f38d6c55700 (LWP 12806)):
#0  0x00007f38db9cc6d5 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00000000005a314d in ActionManager::loop(timespec&, ActionRequest const&) ()
#2  0x00000000004516b1 in ActionManager::loop() ()
#3  0x00000000005a1e24 in hm_action_loop ()
#4  0x00007f38db9c8dc5 in start_thread () from /lib64/libpthread.so.0
#5  0x00007f38daed676d in clone () from /lib64/libc.so.6

Thread 10 (Thread 0x7f38d4c51700 (LWP 12828)):
#0  0x00007f38db9cca82 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00000000005a30f5 in ActionManager::loop(timespec&, ActionRequest const&) ()
#2  0x000000000043aa17 in ActionManager::loop(long) ()
#3  0x000000000042f613 in vmm_action_loop ()
#4  0x00007f38db9c8dc5 in start_thread () from /lib64/libpthread.so.0
#5  0x00007f38daed676d in clone () from /lib64/libc.so.6

Thread 9 (Thread 0x7f38af7fe700 (LWP 12838)):
#0  0x00007f38daecdbd3 in select () from /lib64/libc.so.6
#1  0x000000000050f488 in MadManager::listener() ()
#2  0x000000000050eeb0 in mad_manager_listener ()
#3  0x00007f38db9c8dc5 in start_thread () from /lib64/libpthread.so.0
#4  0x00007f38daed676d in clone () from /lib64/libc.so.6

Thread 8 (Thread 0x7f38d7669700 (LWP 12805)):
#0  0x00007f38daecdbd3 in select () from /lib64/libc.so.6
#1  0x000000000050f488 in MadManager::listener() ()
#2  0x000000000050eeb0 in mad_manager_listener ()
#3  0x00007f38db9c8dc5 in start_thread () from /lib64/libpthread.so.0
#4  0x00007f38daed676d in clone () from /lib64/libc.so.6

Thread 7 (Thread 0x7f38d6454700 (LWP 12824)):
#0  0x00007f38db9cca82 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00000000005a30f5 in ActionManager::loop(timespec&, ActionRequest const&) ()
#2  0x00000000005de14f in ActionManager::loop(timespec&) ()
#3  0x00000000005da2d8 in raft_manager_loop ()
#4  0x00007f38db9c8dc5 in start_thread () from /lib64/libpthread.so.0
#5  0x00007f38daed676d in clone () from /lib64/libc.so.6

Thread 6 (Thread 0x7f38be7fc700 (LWP 12833)):
#0  0x00007f38daecdbd3 in select () from /lib64/libc.so.6
#1  0x000000000050f488 in MadManager::listener() ()
#2  0x000000000050eeb0 in mad_manager_listener ()
#3  0x00007f38db9c8dc5 in start_thread () from /lib64/libpthread.so.0
#4  0x00007f38daed676d in clone () from /lib64/libc.so.6

Thread 5 (Thread 0x7f38d5452700 (LWP 12826)):
#0  0x00007f38daecdbd3 in select () from /lib64/libc.so.6
#1  0x000000000050f488 in MadManager::listener() ()
#2  0x000000000050eeb0 in mad_manager_listener ()
#3  0x00007f38db9c8dc5 in start_thread () from /lib64/libpthread.so.0
#4  0x00007f38daed676d in clone () from /lib64/libc.so.6

Thread 4 (Thread 0x7f38bf7fe700 (LWP 12831)):
#0  0x00007f38daecdbd3 in select () from /lib64/libc.so.6
#1  0x000000000050f488 in MadManager::listener() ()
#2  0x000000000050eeb0 in mad_manager_listener ()
#3  0x00007f38db9c8dc5 in start_thread () from /lib64/libpthread.so.0
#4  0x00007f38daed676d in clone () from /lib64/libc.so.6

Thread 3 (Thread 0x7f38d5c53700 (LWP 12825)):
#0  0x00007f38db9cc6d5 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00000000005a314d in ActionManager::loop(timespec&, ActionRequest const&) ()
#2  0x00000000004516b1 in ActionManager::loop() ()
#3  0x00000000005e46b9 in frm_loop ()
#4  0x00007f38db9c8dc5 in start_thread () from /lib64/libpthread.so.0
#5  0x00007f38daed676d in clone () from /lib64/libc.so.6

Thread 2 (Thread 0x7f38797fa700 (LWP 29551)):
#0  0x00007f38dae5c38e in _int_malloc () from /lib64/libc.so.6
#1  0x00007f38dae5efbc in malloc () from /lib64/libc.so.6
#2  0x00007f38dc391a62 in Curl_llist_alloc () from /lib64/libcurl.so.4
#3  0x00007f38dc391e5d in Curl_hash_init () from /lib64/libcurl.so.4
#4  0x00007f38dc391f4f in Curl_hash_alloc () from /lib64/libcurl.so.4
#5  0x00007f38dc392d57 in curl_multi_init () from /lib64/libcurl.so.4
#6  0x00000000005f0a36 in curlMulti_create ()
#7  0x00000000005ef92d in create ()
#8  0x00000000005ed1bb in xmlrpc_c::clientXmlTransport_curl::initialize(xmlrpc_c::clientXmlTransport_curl::constrOpt const&) ()
#9  0x00000000005ed2fc in xmlrpc_c::clientXmlTransport_curl::clientXmlTransport_curl(xmlrpc_c::clientXmlTransport_curl::constrOpt const&) ()
#10 0x00000000005b07c6 in Client::call(std::string const&, std::string const&, xmlrpc_c::paramList const&, unsigned int, xmlrpc_c::value*, std::string&) ()
#11 0x00000000005dcb96 in RaftManager::xmlrpc_replicate_log(int, LogDBRecord*, bool&, unsigned int&, std::string&) ()
#12 0x00000000005e3d05 in HeartBeatThread::replicate() ()
#13 0x00000000005e35b0 in ReplicaThread::do_replication() ()
#14 0x00000000005e343d in replication_thread ()
#15 0x00007f38db9c8dc5 in start_thread () from /lib64/libpthread.so.0
#16 0x00007f38daed676d in clone () from /lib64/libc.so.6

Thread 1 (Thread 0x7f3879ffb700 (LWP 29550)):
#0  0x00007f38dae141d7 in raise () from /lib64/libc.so.6
#1  0x00007f38dae158c8 in abort () from /lib64/libc.so.6
#2  0x00007f38db7189d5 in __gnu_cxx::__verbose_terminate_handler() () from /lib64/libstdc++.so.6
#3  0x00007f38db716946 in ?? () from /lib64/libstdc++.so.6
#4  0x00007f38db716973 in std::terminate() () from /lib64/libstdc++.so.6
#5  0x00007f38db716b93 in __cxa_throw () from /lib64/libstdc++.so.6
#6  0x000000000060948e in (anonymous namespace)::throwIfError ()
#7  0x000000000060c640 in xmlrpc_c::cNewStringWrapper::cNewStringWrapper(std::string, xmlrpc_c::value_string::nlCode) ()
#8  0x000000000060ad68 in xmlrpc_c::value_string::value_string(std::string const&) ()
#9  0x00000000005dc74d in RaftManager::xmlrpc_replicate_log(int, LogDBRecord*, bool&, unsigned int&, std::string&) ()
#10 0x00000000005e3d05 in HeartBeatThread::replicate() ()
#11 0x00000000005e35b0 in ReplicaThread::do_replication() ()
#12 0x00000000005e343d in replication_thread ()
#13 0x00007f38db9c8dc5 in start_thread () from /lib64/libpthread.so.0
#14 0x00007f38daed676d in clone () from /lib64/libc.so.6
(gdb)

you need to install the debug package in order to get the symbols.

yum install opennebula-debuginfo 

Done.

[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `/usr/bin/oned -f'.
Program terminated with signal 6, Aborted.
#0  0x00007f5b2f5cb1d7 in raise () from /lib64/libc.so.6
Missing separate debuginfos, use: debuginfo-install cyrus-sasl-lib-2.1.26-20.el7_2.x86_64 glibc-2.17-157.el7_3.5.x86_64 keyutils-libs-1.5.8-3.el7.x86_64 krb5-libs-1.14.1-27.el7_3.x86_64 libcom_err-1.42.9-9.el7.x86_64 libcurl-7.29.0-35.el7.centos.x86_64 libgcc-4.8.5-11.el7.x86_64 libidn-1.28-4.el7.x86_64 libselinux-2.5-6.el7.x86_64 libssh2-1.4.3-10.el7_2.1.x86_64 libstdc++-4.8.5-11.el7.x86_64 libxml2-2.9.1-6.el7_2.3.x86_64 mariadb-libs-5.5.52-1.el7.x86_64 nspr-4.13.1-1.0.el7_3.x86_64 nss-3.28.4-1.2.el7_3.x86_64 nss-softokn-freebl-3.16.2.3-14.4.el7.x86_64 nss-util-3.28.4-1.0.el7_3.x86_64 openldap-2.4.40-13.el7.x86_64 openssl-libs-1.0.1e-60.el7_3.1.x86_64 pcre-8.32-15.el7_2.1.x86_64 sqlite-3.7.17-8.el7.x86_64 xz-libs-5.2.2-1.el7.x86_64 zlib-1.2.7-17.el7.x86_64
(gdb) thread 1
[Switching to thread 1 (Thread 0x7f5ad27fc700 (LWP 28842))]
#0  0x00007f5b2f5cb1d7 in raise () from /lib64/libc.so.6
(gdb) fr 9
#9  0x00000000005dc74d in RaftManager::xmlrpc_replicate_log (this=0x22d6490, follower_id=0, lr=0x7f5ad27fbb90, success=@0x7f5ad27fbd87: false, fterm=@0x7f5ad27fbd6c: 32603, error="")
    at src/raft/RaftManager.cc:1024
1024	    replica_params.add(xmlrpc_c::value_string(secret));
(gdb) list
1019	    }
1020
1021	    xmlrpc_c::value result;
1022	    xmlrpc_c::paramList replica_params;
1023
1024	    replica_params.add(xmlrpc_c::value_string(secret));
1025	    replica_params.add(xmlrpc_c::value_int(_server_id));
1026	    replica_params.add(xmlrpc_c::value_int(_commit));
1027	    replica_params.add(xmlrpc_c::value_int(_term));
1028	    replica_params.add(xmlrpc_c::value_int(lr->index));
(gdb) p values
No symbol "values" in current context.
(gdb) p lr
$1 = (LogDBRecord *) 0x7f5ad27fbb90
(gdb)

ok, so it seems its the secret. Could you print its value

p secret

Maybe send me the output in a private message

Just send you a message.

HI guys,
I have the same situation, opennebula.service unexpectedly died with no errors.

The only thing that I found is:

Fri Jul 28 15:32:56 2017 [Z0][DBM][E]: Log record 5813 loaded incorrectly. Record index: 5814 fed. index: 0 sql command: . Operation return code: 0
Fri Jul 28 15:32:56 2017 [Z0][DBM][E]: Log record 5814 loaded incorrectly. Record index: 5815 fed. index: 0 sql command: . Operation return code: 0
Fri Jul 28 15:32:57 2017 [Z0][DBM][E]: Log record 5815 loaded incorrectly. Record index: 5816 fed. index: 0 sql command: . Operation return code: 0

Leonid, it seems there is a bug that may abrt oned when sending a heartbeat. Didn’t try but setting ONE_AUTH instead of using the default path may solve the issue till a proper patch is uploaded

What do you think? When will we have new corrected packages available?
We would like to go live with 5.4

We want to run 5.4 for another month before making an update. You can apply the following patch, it contains final fixes for the issues found in 5.4, (note that we’ve a workaround for most of them):

https://github.com/OpenNebula/one/commit/a6addb314e63361aeabc2a63803572456debd85c