Oned process segfault (error 4 in libc-2.17.so)

Hi, all

Recently I’ve started using latest OpenNebula 5.0.1 2-node HA cluster (corosync + pacemaker) with the CEPH “Jewel” backend and everything works as expected, but I’ve met a strange issue with the instability of the “oned” process, when it randomly crashes on two completely different hardware platforms.

Ok, I have two hosts with the same OS - CentOS 7.2 x86_64:

uname -a
Linux 4.6.3-1.el7.elrepo.x86_64 #1 SMP x86_64 GNU/Linux

But with different hardware:

[root@nebula-1 ~]# lspci
00:00.0 Host bridge: Intel Corporation 5000V Chipset Memory Controller Hub (rev b1)
05:0e.0 RAID bus controller: Adaptec AAC-RAID
07:00.0 Ethernet controller: Intel Corporation 80003ES2LAN Gigabit Ethernet Controller (Copper) (rev 01)
0b:0c.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] ES1000 (rev 02)

[root@nebula-2 lock]# lspci
00:00.0 Host bridge: Intel Corporation 4th Gen Core Processor DRAM Controller (rev 06)
00:01.0 PCI bridge: Intel Corporation Xeon E3-1200 v3/4th Gen Core Processor PCI Express x16 Controller (rev 06)
00:02.0 VGA compatible controller: Intel Corporation Xeon E3-1200 v3/4th Gen Core Processor Integrated Graphics Controller (rev 06)
00:14.0 USB controller: Intel Corporation 9 Series Chipset Family USB xHCI Controller
00:16.0 Communication controller: Intel Corporation 9 Series Chipset Family ME Interface #1
00:1f.3 SMBus: Intel Corporation 9 Series Chipset Family SMBus Controller
01:00.0 VGA compatible controller: NVIDIA Corporation GM200 [GeForce GTX 980 Ti] (rev a1)
01:00.1 Audio device: NVIDIA Corporation Device 0fb0 (rev a1)
03:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (rev 11)
04:00.0 PCI bridge: ASMedia Technology Inc. ASM1083/1085 PCIe to PCI Bridge (rev 04)
06:00.0 RAID bus controller: Adaptec AAC-RAID (rev 09)
07:00.0 Ethernet controller: Intel Corporation 82572EI Gigabit Ethernet Controller (Copper) (rev 06)

The error is:

Jul 14 23:27:42 nebula-2 kernel: oned[8051]: segfault at 7f0b14218000 ip 00007f0b5b7e1138 sp 00007f0b2fffe598 error 4 in libc-2.17.so[7f0b5b697000+1b7000]
systemd: opennebula.service: main process exited, code=killed, status=11/SEGV

Jul 15 00:06:01 nebula-1 kernel: oned[12086]: segfault at ffffffffffffffe8 ip 00007fb6a0f8e583 sp 00007fb6817f9648 error 5 in libstdc++.so.6.0.19[7fb6a0ef9000+e9000]
systemd: opennebula.service: main process exited, code=killed, status=11/SEGV

Jul 19 09:06:18 nebula-2 kernel: oned[30185]: segfault at 7f5068217000 ip 00007f50af55d138 sp 00007f5083ffe598 error 4 in libc-2.17.so[7f50af413000+1b7000]
systemd: opennebula.service: main process exited, code=killed, status=11/SEGV

After this crash the process “oned” cannot be restarted on the host by Linux HA, because of the error: “Cannot start oned, opening lock file /var/lock/one/one”.

I can start it only when I manually delete lock file /var/lock/one/one.
It’s ok with the locked file, but segmentation fault is very disturbs me, because I’m going to introduce OpenNebula into the production and this issue prevents me from doing this.

Does anyone else have this problem, or it means that there is something wrong with the software environment?
Or may be somebody can help me to understand and then eliminate the source of the error?

Regards,
Sola

It is indeed, can you give us more details? is there any relevant info in the logs? Maybe you have the core in the system (e.g. coredumpctl…)?

Hi, Ruben!

Sorry, I didn’t find any dump file :frowning:

And here we go with logs :slight_smile:

oned.log:

Tue Jul 19 09:03:07 2016 [Z0][MKP][D]: Monitoring marketplace OpenNebula Public (0)
Tue Jul 19 09:03:09 2016 [Z0][MKP][D]: Marketplace OpenNebula Public (0) successfully monitored.

Tue Jul 19 09:03:19 2016 [Z0][InM][D]: Host nebula-1 (0) successfully monitored.
Tue Jul 19 09:03:24 2016 [Z0][InM][D]: Host nebula-2 (1) successfully monitored.
Tue Jul 19 09:03:33 2016 [Z0][ReM][D]: Req:4000 UID:0 VirtualMachinePoolInfo invoked , -2, -1, -1, -1
Tue Jul 19 09:03:33 2016 [Z0][ReM][D]: Req:4000 UID:0 VirtualMachinePoolInfo result SUCCESS, "<VM_POOL></VM_POOL>"
Tue Jul 19 09:03:34 2016 [Z0][ReM][D]: Req:7200 UID:0 VirtualMachinePoolInfo invoked , -2, -1, -1, -1
Tue Jul 19 09:03:34 2016 [Z0][ReM][D]: Req:7200 UID:0 VirtualMachinePoolInfo result SUCCESS, "<VM_POOL></VM_POOL>"
Tue Jul 19 09:03:39 2016 [Z0][InM][D]: Host nebula-1 (0) successfully monitored.
Tue Jul 19 09:03:44 2016 [Z0][InM][D]: Host nebula-2 (1) successfully monitored.
Tue Jul 19 09:03:48 2016 [Z0][InM][D]: Monitoring datastore default (1)
Tue Jul 19 09:03:48 2016 [Z0][InM][D]: Monitoring datastore files (2)
Tue Jul 19 09:03:49 2016 [Z0][InM][D]: Monitoring datastore cephds (102)
Tue Jul 19 09:03:49 2016 [Z0][InM][D]: Monitoring datastore system_ceph (104)
Tue Jul 19 09:03:49 2016 [Z0][ImM][D]: Datastore default (1) successfully monitored.
Tue Jul 19 09:03:49 2016 [Z0][ImM][D]: Datastore files (2) successfully monitored.
Tue Jul 19 09:03:49 2016 [Z0][ImM][D]: Datastore system_ceph (104) successfully monitored.
Tue Jul 19 09:03:49 2016 [Z0][ImM][D]: Datastore cephds (102) successfully monitored.
Tue Jul 19 09:03:53 2016 [Z0][VMM][I]: --Mark–
Tue Jul 19 09:04:03 2016 [Z0][MKP][D]: Monitoring marketplace OpenNebula Public (0)
Tue Jul 19 09:04:04 2016 [Z0][ReM][D]: Req:4800 UID:0 VirtualMachinePoolInfo invoked , -2, -1, -1, -1
Tue Jul 19 09:04:04 2016 [Z0][ReM][D]: Req:4800 UID:0 VirtualMachinePoolInfo result SUCCESS, "<VM_POOL></VM_POOL>"
Tue Jul 19 09:04:04 2016 [Z0][ReM][D]: Req:8800 UID:0 VirtualMachinePoolInfo invoked , -2, -1, -1, -1
Tue Jul 19 09:04:04 2016 [Z0][ReM][D]: Req:8800 UID:0 VirtualMachinePoolInfo result SUCCESS, "<VM_POOL></VM_POOL>"
Tue Jul 19 09:04:04 2016 [Z0][InM][D]: Host nebula-1 (0) successfully monitored.
Tue Jul 19 09:04:04 2016 [Z0][InM][D]: Host nebula-2 (1) successfully monitored.
Tue Jul 19 09:04:24 2016 [Z0][InM][D]: Host nebula-1 (0) successfully monitored.
Tue Jul 19 09:04:24 2016 [Z0][InM][D]: Host nebula-2 (1) successfully monitored.
Tue Jul 19 09:04:33 2016 [Z0][ReM][D]: Req:5600 UID:0 VirtualMachinePoolInfo invoked , -2, -1, -1, -1
Tue Jul 19 09:04:33 2016 [Z0][ReM][D]: Req:5600 UID:0 VirtualMachinePoolInfo result SUCCESS, "<VM_POOL></VM_POOL>"
Tue Jul 19 09:04:34 2016 [Z0][ReM][D]: Req:2800 UID:0 VirtualMachinePoolInfo invoked , -2, -1, -1, -1
Tue Jul 19 09:04:34 2016 [Z0][ReM][D]: Req:2800 UID:0 VirtualMachinePoolInfo result SUCCESS, "<VM_POOL></VM_POOL>"
Tue Jul 19 09:04:44 2016 [Z0][InM][D]: Host nebula-1 (0) successfully monitored.
Tue Jul 19 09:04:44 2016 [Z0][InM][D]: Host nebula-2 (1) successfully monitored.
Tue Jul 19 09:04:46 2016 [Z0][InM][D]: Monitoring datastore default (1)
Tue Jul 19 09:04:46 2016 [Z0][InM][D]: Monitoring datastore files (2)
Tue Jul 19 09:04:46 2016 [Z0][InM][D]: Monitoring datastore cephds (102)
Tue Jul 19 09:04:46 2016 [Z0][InM][D]: Monitoring datastore system_ceph (104)
Tue Jul 19 09:04:47 2016 [Z0][ImM][D]: Datastore files (2) successfully monitored.
Tue Jul 19 09:04:47 2016 [Z0][ImM][D]: Datastore default (1) successfully monitored.
Tue Jul 19 09:04:47 2016 [Z0][ImM][D]: Datastore system_ceph (104) successfully monitored.
Tue Jul 19 09:04:47 2016 [Z0][ImM][D]: Datastore cephds (102) successfully monitored.
Tue Jul 19 09:05:00 2016 [Z0][MKP][D]: Monitoring marketplace OpenNebula Public (0)
Tue Jul 19 09:05:03 2016 [Z0][ReM][D]: Req:3200 UID:0 VirtualMachinePoolInfo invoked , -2, -1, -1, -1
Tue Jul 19 09:05:03 2016 [Z0][ReM][D]: Req:3200 UID:0 VirtualMachinePoolInfo result SUCCESS, "<VM_POOL></VM_POOL>"
Tue Jul 19 09:05:04 2016 [Z0][ReM][D]: Req:1200 UID:0 VirtualMachinePoolInfo invoked , -2, -1, -1, -1
Tue Jul 19 09:05:04 2016 [Z0][ReM][D]: Req:1200 UID:0 VirtualMachinePoolInfo result SUCCESS, "<VM_POOL></VM_POOL>"
Tue Jul 19 09:05:04 2016 [Z0][InM][D]: Host nebula-1 (0) successfully monitored.
Tue Jul 19 09:05:04 2016 [Z0][InM][D]: Host nebula-2 (1) successfully monitored.
Tue Jul 19 09:05:22 2016 [Z0][MKP][E]: Error parsing marketplace information: syntax error, unexpected VARIABLE, expecting EQUAL or EQUAL_EMPTY at line 2, columns 7:10. Monitoring information:86
Tue Jul 19 09:05:24 2016 [Z0][InM][D]: Host nebula-2 (1) successfully monitored.
Tue Jul 19 09:05:24 2016 [Z0][InM][D]: Host nebula-1 (0) successfully monitored.
Tue Jul 19 09:05:28 2016 [Z0][ImM][I]: --Mark–
Tue Jul 19 09:05:33 2016 [Z0][ReM][D]: Req:0 UID:0 VirtualMachinePoolInfo invoked , -2, -1, -1, -1
Tue Jul 19 09:05:34 2016 [Z0][ReM][D]: Req:0 UID:0 VirtualMachinePoolInfo result SUCCESS, "<VM_POOL></VM_POOL>"
Tue Jul 19 09:05:34 2016 [Z0][ReM][D]: Req:8000 UID:0 VirtualMachinePoolInfo invoked , -2, -1, -1, -1
Tue Jul 19 09:05:34 2016 [Z0][ReM][D]: Req:8000 UID:0 VirtualMachinePoolInfo result SUCCESS, "<VM_POOL></VM_POOL>"
Tue Jul 19 09:05:42 2016 [Z0][InM][D]: Monitoring datastore default (1)
Tue Jul 19 09:05:42 2016 [Z0][InM][D]: Monitoring datastore files (2)
Tue Jul 19 09:05:42 2016 [Z0][InM][D]: Monitoring datastore cephds (102)
Tue Jul 19 09:05:42 2016 [Z0][InM][D]: Monitoring datastore system_ceph (104)
Tue Jul 19 09:05:43 2016 [Z0][ImM][D]: Datastore system_ceph (104) successfully monitored.
Tue Jul 19 09:05:43 2016 [Z0][ImM][D]: Datastore default (1) successfully monitored.
Tue Jul 19 09:05:43 2016 [Z0][ImM][D]: Datastore files (2) successfully monitored.
Tue Jul 19 09:05:43 2016 [Z0][ImM][D]: Datastore cephds (102) successfully monitored.
Tue Jul 19 09:05:44 2016 [Z0][InM][D]: Host nebula-2 (1) successfully monitored.
Tue Jul 19 09:05:44 2016 [Z0][InM][D]: Host nebula-1 (0) successfully monitored.
Tue Jul 19 09:05:56 2016 [Z0][MKP][D]: Monitoring marketplace OpenNebula Public (0)
Tue Jul 19 09:06:04 2016 [Z0][ReM][D]: Req:5200 UID:0 VirtualMachinePoolInfo invoked , -2, -1, -1, -1
Tue Jul 19 09:06:04 2016 [Z0][ReM][D]: Req:5200 UID:0 VirtualMachinePoolInfo result SUCCESS, "<VM_POOL></VM_POOL>"
Tue Jul 19 09:06:04 2016 [Z0][ReM][D]: Req:5600 UID:0 VirtualMachinePoolInfo invoked , -2, -1, -1, -1
Tue Jul 19 09:06:04 2016 [Z0][ReM][D]: Req:5600 UID:0 VirtualMachinePoolInfo result SUCCESS, “<VM_POOL></VM_POOL>”

Tue Jul 19 09:06:04 2016 [Z0][MKP][I]: Command execution fail: /var/lib/one/remotes/market/one/monitor PE1BUktFVF9EUklWRVJfQUNUSU9OX0RBVEE+PE1BUktFVFBMQUNFPjxJRD4wPC9JRD48VUlEPjA8L1VJRD48R0lEPjA8L0dJRD48VU5BTUU+b25lYWRtaW48L1VOQU1FPjxHTkFNRT5vbmVhZG1pbjwvR05BTUU+PE5BTUU+T3Blbk5lYnVsYSBQdWJsaWM8L05BTUU+PE1BUktFVF9NQUQRVNDUklQVElPTj48TUFSS0VUX01BRD48IVtDREFUQVtvbmVdXT48L01BUktFVF9NQUQ+PC9URU1QTEFURT48L01BUktFVFBMQUNFPjwvTUFSS0VUX0RSSVZFUl9BQ1RJT05fREFUQT4= 0
Tue Jul 19 09:06:04 2016 [Z0][MKP][I]: /usr/share/ruby/net/protocol.rb:158:in rescue in rbuf_fill': Net::ReadTimeout (Net::ReadTimeout)_ _Tue Jul 19 09:06:04 2016 [Z0][MKP][I]: from /usr/share/ruby/net/protocol.rb:152:inrbuf_fill’
Tue Jul 19 09:06:04 2016 [Z0][MKP][I]: from /usr/share/ruby/net/protocol.rb:134:in readuntil'_ _Tue Jul 19 09:06:04 2016 [Z0][MKP][I]: from /usr/share/ruby/net/protocol.rb:144:inreadline’
Tue Jul 19 09:06:04 2016 [Z0][MKP][I]: from /usr/share/ruby/net/http/response.rb:39:in read_status_line'_ _Tue Jul 19 09:06:04 2016 [Z0][MKP][I]: from /usr/share/ruby/net/http/response.rb:28:inread_new’
Tue Jul 19 09:06:04 2016 [Z0][MKP][I]: from /usr/share/ruby/net/http.rb:1406:in block in transport_request'_ _Tue Jul 19 09:06:04 2016 [Z0][MKP][I]: from /usr/share/ruby/net/http.rb:1403:incatch’
Tue Jul 19 09:06:04 2016 [Z0][MKP][I]: from /usr/share/ruby/net/http.rb:1403:in transport_request'_ _Tue Jul 19 09:06:04 2016 [Z0][MKP][I]: from /usr/share/ruby/net/http.rb:1376:inrequest’
Tue Jul 19 09:06:04 2016 [Z0][MKP][I]: from /var/lib/one/remotes/market/one/monitor:42:in block in get'_ _Tue Jul 19 09:06:04 2016 [Z0][MKP][I]: from /usr/share/ruby/net/http.rb:852:instart’
Tue Jul 19 09:06:04 2016 [Z0][MKP][I]: from /usr/share/ruby/net/http.rb:582:in start'_ _Tue Jul 19 09:06:04 2016 [Z0][MKP][I]: from /var/lib/one/remotes/market/one/monitor:41:inget’
Tue Jul 19 09:06:04 2016 [Z0][MKP][I]: from /var/lib/one/remotes/market/one/monitor:53:in get_appliances'_ _Tue Jul 19 09:06:04 2016 [Z0][MKP][I]: from /var/lib/one/remotes/market/one/monitor:157:in'
Tue Jul 19 09:06:04 2016 [Z0][MKP][I]: ExitCode: 1
Tue Jul 19 09:06:04 2016 [Z0][MKP][E]: Error monitoring datastore 0: -
Tue Jul 19 09:06:04 2016 [Z0][InM][D]: Host nebula-2 (1) successfully monitored.
Tue Jul 19 09:06:04 2016 [Z0][InM][D]: Host nebula-1 (0) successfully monitored.

SCHED.log:

Tue Jul 19 09:06:42 2016 [Z0][SCHED][I]: Starting Scheduler Daemon
----------------------------------------
_ Scheduler Configuration File_
----------------------------------------
DEFAULT_DS_SCHED=POLICY=1
DEFAULT_SCHED=POLICY=1
LIVE_RESCHEDS=0
LOG=DEBUG_LEVEL=3,SYSTEM=file
MAX_DISPATCH=30
MAX_HOST=1
MAX_VM=5000
MESSAGE_SIZE=1073741824
ONE_XMLRPC=http://localhost:2633/RPC2
SCHED_INTERVAL=30
TIMEOUT=60
----------------------------------------
Tue Jul 19 09:06:42 2016 [Z0][SCHED][I]: XML-RPC client using 1073741824 bytes for response buffer.

Tue Jul 19 09:06:42 2016 [Z0][SCHED][E]: Cannot contact oned, will retry… Error: HTTP POST to URL ‘http://localhost:2633/RPC2’ failed. libcurl failed even to execute the HTTP transaction, explaining: Failed connect to localhost:2633; Connection refused

messages log:

Jul 19 09:06:18 nebula-2 kernel: oned[30185]: segfault at 7f5068217000 ip 00007f50af55d138 sp 00007f5083ffe598 error 4 in libc-2.17.so[7f50af413000+1b7000]
Jul 19 09:06:18 nebula-2 systemd: opennebula.service: main process exited, code=killed, status=11/SEGV
Jul 19 09:06:18 nebula-2 systemd: opennebula.service: control process exited, code=exited status=1
Jul 19 09:06:18 nebula-2 systemd: Stopped Cluster Controlled opennebula.
Jul 19 09:06:18 nebula-2 systemd: Unit opennebula.service entered failed state.
Jul 19 09:06:18 nebula-2 systemd: opennebula.service failed.
Jul 19 09:06:18 nebula-2 systemd: Stopping OpenNebula Cloud Scheduler Daemon…
Jul 19 09:06:18 nebula-2 systemd: Stopped OpenNebula Cloud Scheduler Daemon.
Jul 19 09:06:35 nebula-2 systemd: Reloading.
Jul 19 09:06:35 nebula-2 systemd: [/usr/lib/systemd/system/ceph-osd@.service:18] Unknown lvalue ‘TasksMax’ in section 'Service’
Jul 19 09:06:35 nebula-2 systemd: [/usr/lib/systemd/system/ceph-mds@.service:18] Unknown lvalue ‘TasksMax’ in section 'Service’
Jul 19 09:06:35 nebula-2 systemd: [/usr/lib/systemd/system/ceph-mon@.service:24] Unknown lvalue ‘TasksMax’ in section 'Service’
Jul 19 09:06:35 nebula-2 systemd: Configuration file /usr/lib/systemd/system/auditd.service is marked world-inaccessible. This has no effect as configuration data is accessible via APIs without restrictions. Proceeding anyway.
Jul 19 09:06:35 nebula-2 systemd: Configuration file /usr/lib/systemd/system/opennebula.service is marked executable. Please remove executable permission bits. Proceeding anyway.
Jul 19 09:06:35 nebula-2 systemd: Configuration file /run/systemd/system/opennebula.service.d/50-pacemaker.conf is marked world-inaccessible. This has no effect as configuration data is accessible via APIs without restrictions. Proceeding anyway.
Jul 19 09:06:35 nebula-2 systemd: Configuration file /run/systemd/system/opennebula-sunstone.service.d/50-pacemaker.conf is marked world-inaccessible. This has no effect as configuration data is accessible via APIs without restrictions. Proceeding anyway.
Jul 19 09:06:35 nebula-2 systemd: Configuration file /usr/lib/systemd/system/opennebula-novnc.service is marked executable. Please remove executable permission bits. Proceeding anyway.
Jul 19 09:06:35 nebula-2 systemd: Stopping Cluster Controlled opennebula-sunstone…
Jul 19 09:06:36 nebula-2 ruby: == Sinatra has ended his set (crowd applauds)
Jul 19 09:06:36 nebula-2 ruby: Thin web server (v1.7.0 codename Dunder Mifflin)
Jul 19 09:06:36 nebula-2 ruby: Maximum connections set to 1024
Jul 19 09:06:36 nebula-2 ruby: Listening on 0.0.0.0:9869, CTRL+C to stop
Jul 19 09:06:36 nebula-2 ruby: Stopping …
Jul 19 09:06:36 nebula-2 systemd: Stopped Cluster Controlled opennebula-sunstone.
Jul 19 09:06:36 nebula-2 systemd: Stopping OpenNebula noVNC Server…
Jul 19 09:06:36 nebula-2 systemd: Stopped OpenNebula noVNC Server.

And here we can see that the Linux HA tried to restart OpenNebula but fails and for some reason it cannot migrate it on the other host (I guess because the process “oned” was killed by segfault and Resources Group just stopped all services except ClusterIP Address):

Jul 19 09:06:42 nebula-2 systemd: Starting Cluster Controlled opennebula…
Jul 19 09:06:42 nebula-2 systemd: Started Cluster Controlled opennebula.
Jul 19 09:06:42 nebula-2 systemd: Started OpenNebula Cloud Scheduler Daemon.
Jul 19 09:06:42 nebula-2 systemd: Starting OpenNebula Cloud Scheduler Daemon…
Jul 19 09:06:42 nebula-2 oned: Error: Cannot start oned, opening lock file /var/lock/one/one
Jul 19 09:06:42 nebula-2 systemd: opennebula.service: main process exited, code=exited, status=255/n/a
Jul 19 09:06:42 nebula-2 systemd: Stopping OpenNebula Cloud Scheduler Daemon…
Jul 19 09:06:42 nebula-2 systemd: opennebula.service: control process exited, code=exited status=1
Jul 19 09:06:42 nebula-2 systemd: Stopped Cluster Controlled opennebula.
Jul 19 09:06:42 nebula-2 systemd: Unit opennebula.service entered failed state.
Jul 19 09:06:42 nebula-2 systemd: opennebula.service failed.
Jul 19 09:06:42 nebula-2 systemd: Stopped OpenNebula Cloud Scheduler Daemon.

Regards,
Sola

I think you are being hit by this one:

http://dev.opennebula.org/projects/opennebula/repository/revisions/63e736ab3de29f3db69d4f02e96985f115e5e961

This is now fixed in the repo. and will be released in a maintenance
release this month

Do you think that this parsing error can lead to the crash of the entire “oned” process?

Of course, I’ll try a new maintenance release immediately when it arrives, but can I somehow fix this error now?

And another question - can I configure OpenNebula to write a dump when one of it’s processess will crash?

same problem here. We try to have coredump too with no luck for the moment. So if any one has a suggestion that would be great. We configured the service to restart on abnormal-exit, will see if it works.

Hi, Edouard

How did you configured the service to restart on abnormal exit? Do you have a Linux HA in the place or just a simple setup?

-----Исходное сообщение-----
От: “Edouard” opennebula@discoursemail.com
Отправлено: ‎19.‎07.‎2016 20:27

Here is my /lib/systemd/system/opennebula.service :

[Unit]
Description=OpenNebula Cloud Controller Daemon
After=syslog.target
After=network.target
After=remote-fs.target
After=mariadb.service
Before=opennebula-scheduler.service
BindTo=opennebula-scheduler.service
 
[Service]
Type=simple
Group=oneadmin
User=oneadmin

ExecStart=/usr/bin/oned -f
ExecStartPre=/bin/mkdir -p /var/log/one
ExecStartPre=/bin/chown oneadmin:oneadmin /var/log/one
ExecStartPre=/usr/sbin/logrotate -s /tmp/logrotate.state -f /etc/logrotate.d/opennebula
ExecStopPost=/usr/bin/rm /var/lock/one/one
KillMode=process
KillSignal=SIGTERM

Restart=on-abnormal

[Install]
WantedBy=multi-user.target

But no segfault yet to test this…

Thanks for sharing, I’ll try this config to see what will happen when oned will crash.

It’s a pity that I can’t see or find any dependency with the segfault, and this happens randomly within 1 - 4 days.
I can only see that both servers doesn’t have any load and this happens during not working hours :slight_smile:

Again have caught segfault, this time no errors about marketplace monitoring:

messages log

Jul 21 07:01:53 nebula-2 kernel: oned[15578]: segfault at 7fedf42c2000 ip 00007fee40d2411b sp 00007fee21ffa598 error 4 in libc-2.17.so[7fee40bda000+1b7000]
Jul 21 07:01:53 nebula-2 systemd: opennebula.service: main process exited, code=killed, status=11/SEGV
Jul 21 07:01:53 nebula-2 systemd: opennebula.service: control process exited, code=exited status=1
Jul 21 07:01:53 nebula-2 systemd: Stopped Cluster Controlled opennebula.
Jul 21 07:01:53 nebula-2 systemd: Unit opennebula.service entered failed state.
Jul 21 07:01:53 nebula-2 systemd: opennebula.service failed.
Jul 21 07:01:53 nebula-2 systemd: Stopping OpenNebula Cloud Scheduler Daemon…
Jul 21 07:01:53 nebula-2 systemd: Stopped OpenNebula Cloud Scheduler Daemon.
Jul 21 07:02:27 nebula-2 systemd: Reloading.

Jul 21 07:02:33 nebula-2 systemd: Starting Cluster Controlled opennebula…
Jul 21 07:02:33 nebula-2 systemd: Started Cluster Controlled opennebula.
Jul 21 07:02:33 nebula-2 systemd: Started OpenNebula Cloud Scheduler Daemon.
Jul 21 07:02:33 nebula-2 systemd: Starting OpenNebula Cloud Scheduler Daemon…
Jul 21 07:02:33 nebula-2 oned: Error: Cannot start oned, opening lock file /var/lock/one/one
Jul 21 07:02:33 nebula-2 systemd: opennebula.service: main process exited, code=exited, status=255/n/a
Jul 21 07:02:33 nebula-2 systemd: Stopping OpenNebula Cloud Scheduler Daemon…
Jul 21 07:02:33 nebula-2 systemd: opennebula.service: control process exited, code=exited status=1
Jul 21 07:02:33 nebula-2 systemd: Stopped Cluster Controlled opennebula.
Jul 21 07:02:33 nebula-2 systemd: Unit opennebula.service entered failed state.
Jul 21 07:02:33 nebula-2 systemd: opennebula.service failed.
Jul 21 07:02:33 nebula-2 systemd: Stopped OpenNebula Cloud Scheduler Daemon.

oned.log

Thu Jul 21 07:01:09 2016 [Z0][InM][D]: Monitoring datastore default (1)
Thu Jul 21 07:01:09 2016 [Z0][InM][D]: Monitoring datastore files (2)
Thu Jul 21 07:01:09 2016 [Z0][InM][D]: Monitoring datastore cephds (102)
Thu Jul 21 07:01:10 2016 [Z0][InM][D]: Monitoring datastore system_ceph (104)
Thu Jul 21 07:01:10 2016 [Z0][ImM][D]: Datastore files (2) successfully monitored.
Thu Jul 21 07:01:10 2016 [Z0][ImM][D]: Datastore default (1) successfully monitored.
Thu Jul 21 07:01:10 2016 [Z0][VMM][D]: VM 6 successfully monitored: DISK_SIZE=[ID=1,SIZE=1]
Thu Jul 21 07:01:10 2016 [Z0][ImM][D]: Datastore system_ceph (104) successfully monitored.
Thu Jul 21 07:01:10 2016 [Z0][ImM][D]: Datastore cephds (102) successfully monitored.
Thu Jul 21 07:01:19 2016 [Z0][MKP][D]: Marketplace OpenNebula Public (0) successfully monitored.
Thu Jul 21 07:01:26 2016 [Z0][ReM][D]: Req:8528 UID:0 VirtualMachinePoolInfo invoked , -2, -1, -1, -1
Thu Jul 21 07:01:26 2016 [Z0][ReM][D]: Req:8528 UID:0 VirtualMachinePoolInfo result SUCCESS, "<VM_POOL>6</…"
Thu Jul 21 07:01:27 2016 [Z0][ReM][D]: Req:1680 UID:0 VirtualMachinePoolInfo invoked , -2, -1, -1, -1
Thu Jul 21 07:01:27 2016 [Z0][ReM][D]: Req:1680 UID:0 VirtualMachinePoolInfo result SUCCESS, "<VM_POOL>6</…"
Thu Jul 21 07:01:32 2016 [Z0][MKP][D]: Monitoring marketplace OpenNebula Public (0)
Thu Jul 21 07:01:38 2016 [Z0][InM][D]: Monitoring host nebula-1 (2)
Thu Jul 21 07:01:40 2016 [Z0][InM][D]: Host nebula-1 (2) successfully monitored.
Thu Jul 21 07:01:40 2016 [Z0][VMM][D]: VM 6 successfully monitored: STATE=a CPU=1.0 MEMORY=1048576 NETRX=45330252 NETTX=5290

[root@nebula-2 log]# *grep -r segfault .
messages:Jul 19 09:06:18 nebula-2 kernel: oned[30185]: segfault at 7f5068217000 ip 00007f50af55d138 sp 00007f5083ffe598 error 4 in libc-2.17.so[7f50af413000+1b7000]
messages:Jul 21 07:01:53 nebula-2 kernel: oned[15578]: segfault at 7fedf42c2000 ip 00007fee40d2411b sp 00007fee21ffa598 error 4 in libc-2.17.so[7fee40bda000+1b7000]
messages-20160717-1468715401:Jul 14 23:27:42 nebula-2 kernel: oned[8051]: segfault at 7f0b14218000 ip 00007f0b5b7e1138 sp 00007f0b2fffe598 error 4 in libc-2.17.so[7f0b5b697000+1b7000]
messages-20160717-1468715401:Jul 15 13:08:02 nebula-2 kernel: attrd[486]: segfault at 1b8 ip 00007f5a2e718000 sp 00007ffe80532448 error 4 in libqb.so.0.17.1[7f5a2e709000+22000]
messages-20160717-1468715401:Jul 15 13:08:02 nebula-2 kernel: attrd[492]: segfault at 1b8 ip 00007f505e5e4000 sp 00007ffc534e1838 error 4 in libqb.so.0.17.1[7f505e5d5000+22000]

Update
Today I’ve installed a new maintenance release 5.0.2, let’s see how it will fly :slight_smile:

Update #2
No more segfaults since MR 5.0.2 has been installed on 21 July.

Hi,

Same problem here…

Opennebula 5.2.1 with random oned segfault crashes:

[1207677.603276] oned[54334]: segfault at 7f4c68000000 ip 00007f4c6a9da982 sp 00007f4c6cf2c850 error 4 in libc-2.23.so[7f4c6a957000+1bf000]

SO is Ubuntu 16.04, Kernel 4.4.0-78, Libc 2.23-ubuntu7

Any idea? Thanks!

Hi again,

I still have some segfault problems referring to oned and Libc :frowning…

They follow some kind of time and date pattern (mon-wed-mon-wed).

Do you know if oned is doing other scheduled functions near those time windows?

Thanks!

[dt jun 27 17:04:52 2017] oned[32114]: segfault at 7f0ba8000000 ip 00007f0ba96b9532 sp 00007f0babc2c850 error 4 in libc-2.23.so[7f0ba9635000+1c0000]
[dc jul 5 01:28:45 2017] oned[25058]: segfault at 7f7178000000 ip 00007f717aefe532 sp 00007f717d450850 error 4 in libc-2.23.so[7f717ae7a000+1c0000]
[dt jul 11 17:04:25 2017] oned[29868]: segfault at 7fbed0000000 ip 00007fbed2af4532 sp 00007fbed5067850 error 4 in libc-2.23.so[7fbed2a70000+1c0000]
[dc jul 19 00:15:03 2017] oned[34089]: segfault at 7f10b0000000 ip 00007f10b09a7532 sp 00007f10b2f1a850 error 4 in libc-2.23.so[7f10b0923000+1c0000]
[dt jul 25 17:04:11 2017] oned[58054]: segfault at 7f0acc000000 ip 00007f0ace43b532 sp 00007f0ad098d850 error 4 in libc-2.23.so[7f0ace3b7000+1c0000]

There is no time pattern in the code, so this is happening every 24h or something.

I’ll really appreciate if you could save the core and send a backtrace.

Is this 5.4?

I’m having this problem as well via however I can reproduce it at will: