5.4.1 memory issue in oned [closed]

Hi,

i have updated our staging environment to 5.4.1 last week and i currently have the behavior that all 3 nodes of the cluster run out of memory within one day:

The nodes have 4G mem each and basically no load at all, there is one offline kvm host connected (only started when needed for a test) and that’s it. No deployments or even user actions during the weekend.

All memory is allocated by oned but unfortunately i did not have debug log level active so the logs don’t show anything except “mark” and “Purging obsolete LogDB records” until it reaches the point where the first node runs out of memory. At this point on the current leader i get Segfaults and core dumps from the ruby processes in the log but imo this is just the final symptom of running out of memory and swap.

By running onedb fsck on one node i ran into this:
pool_control for table image_pool has last_oid 470, but it is 477

Mysql2::Error: Unknown column 'tablename' in 'where clause'
/usr/lib/ruby/gems/2.4.0/gems/mysql2-0.4.9/lib/mysql2/client.rb:120:in `_query'
/usr/lib/ruby/gems/2.4.0/gems/mysql2-0.4.9/lib/mysql2/client.rb:120:in `block in query'
/usr/lib/ruby/gems/2.4.0/gems/mysql2-0.4.9/lib/mysql2/client.rb:119:in `handle_interrupt'
/usr/lib/ruby/gems/2.4.0/gems/mysql2-0.4.9/lib/mysql2/client.rb:119:in `query'
/usr/lib/ruby/gems/2.4.0/gems/sequel-5.0.0/lib/sequel/adapters/mysql2.rb:137:in `block in _execute'
/usr/lib/ruby/gems/2.4.0/gems/sequel-5.0.0/lib/sequel/database/logging.rb:38:in `log_connection_yield'
/usr/lib/ruby/gems/2.4.0/gems/sequel-5.0.0/lib/sequel/adapters/mysql2.rb:132:in `_execute'
/usr/lib/ruby/gems/2.4.0/gems/sequel-5.0.0/lib/sequel/adapters/utils/mysql_mysql2.rb:38:in `block in execute'
/usr/lib/ruby/gems/2.4.0/gems/sequel-5.0.0/lib/sequel/database/connecting.rb:264:in `block in synchronize'
/usr/lib/ruby/gems/2.4.0/gems/sequel-5.0.0/lib/sequel/connection_pool/threaded.rb:87:in `hold'
/usr/lib/ruby/gems/2.4.0/gems/sequel-5.0.0/lib/sequel/database/connecting.rb:264:in `synchronize'
/usr/lib/ruby/gems/2.4.0/gems/sequel-5.0.0/lib/sequel/adapters/utils/mysql_mysql2.rb:38:in `execute'
/usr/lib/ruby/gems/2.4.0/gems/sequel-5.0.0/lib/sequel/adapters/mysql2.rb:68:in `execute_dui'
/usr/lib/ruby/gems/2.4.0/gems/sequel-5.0.0/lib/sequel/dataset/actions.rb:1092:in `execute_dui'
/usr/lib/ruby/gems/2.4.0/gems/sequel-5.0.0/lib/sequel/dataset/actions.rb:863:in `update'
/usr/lib/one/ruby/onedb/fsck/pool_control.rb:57:in `block (2 levels) in fix_pool_control'
/usr/lib/one/ruby/onedb/fsck/pool_control.rb:55:in `each'
/usr/lib/one/ruby/onedb/fsck/pool_control.rb:55:in `block in fix_pool_control'
/usr/lib/ruby/gems/2.4.0/gems/sequel-5.0.0/lib/sequel/database/transactions.rb:204:in `_transaction'
/usr/lib/ruby/gems/2.4.0/gems/sequel-5.0.0/lib/sequel/database/transactions.rb:179:in `block in transaction'
/usr/lib/ruby/gems/2.4.0/gems/sequel-5.0.0/lib/sequel/database/connecting.rb:264:in `block in synchronize'
/usr/lib/ruby/gems/2.4.0/gems/sequel-5.0.0/lib/sequel/connection_pool/threaded.rb:91:in `hold'
/usr/lib/ruby/gems/2.4.0/gems/sequel-5.0.0/lib/sequel/database/connecting.rb:264:in `synchronize'
/usr/lib/ruby/gems/2.4.0/gems/sequel-5.0.0/lib/sequel/database/transactions.rb:145:in `transaction'
/usr/lib/one/ruby/onedb/fsck/pool_control.rb:54:in `fix_pool_control'
/usr/lib/one/ruby/onedb/fsck.rb:236:in `fsck'
/usr/lib/one/ruby/onedb/onedb.rb:280:in `fsck'
/usr/bin/onedb:397:in `block (2 levels) in <main>'
/usr/lib/one/ruby/cli/command_parser.rb:449:in `run'
/usr/lib/one/ruby/cli/command_parser.rb:76:in `initialize'
/usr/bin/onedb:286:in `new'
/usr/bin/onedb:286:in `<main>'

Error running fsck version 5.4.1

Mysql itself does not reports all tables as ok with the largest being the logdb:
±-------------------±---------------------------------------------------±-----------+
| Database | Table | Size in MB |
±-------------------±---------------------------------------------------±-----------+
| opennebula | logdb | 939.03 |
| opennebula | host_monitoring | 1.52 |

Logdb indizes:
MariaDB [opennebula]> show index from logdb;
±------±-----------±--------------±-------------±------------±----------±------------±---------±-------±-----±-----------±--------±--------------+
| Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment |
±------±-----------±--------------±-------------±------------±----------±------------±---------±-------±-----±-----------±--------±--------------+
| logdb | 0 | PRIMARY | 1 | log_index | A | 442740 | NULL | NULL | | BTREE | | |
| logdb | 1 | timestamp_idx | 1 | timestamp | A | 442740 | NULL | NULL | YES | BTREE | | |
| logdb | 1 | fed_index_idx | 1 | fed_index | A | 2 | NULL | NULL | YES | BTREE | | |
±------±-----------±--------------±-------------±------------±----------±------------±---------±-------±-----±-----------±--------±--------------+
3 rows in set (0.00 sec)

Production which uses the same setup but is running on 5.4.0 does not show any issue like that at all.

I have enabled debug logging on all nodes so maybe i can soon add some more information. Did anybody already observe anything like that or any idea?

Hi

In our long running tests the virt/res size for oned is stable
~3500M/~250M. Could you check the res size of oned in your setup? for just
oned 4G is more than enough…

Hi ruben,

in production where we really have load and days of uptime we have 1862M/217M

In staging with uptime of 3.5h the current leader has 1373M/635M
A node i restarted an hour later is already at 1274M/250M

There is no load on these nodes and the failure of onedb fsck makes me suspicious that something is wrong in the db and the memory fills up with some kind backlog or transaction log or something like that. I did not yet find anything usefull so far in the log though

2h later:
leader: 1393M/954M
I think it’s db related but until i can get this fixed or at least know what is the root cause it blocks our production update to 5.4.1

I have now recreated the staging cluster with the current production db (except vms and hosts) and fsck runs through now. I have to wait for a bit longer to see if the memory behaves normal now.

Edit: does not look like anything has changed fundamentally with the fresh cluster. The staging environment on 5.4.1 already uses more memory than production on 5.4.0. Both environments are based on the same data, staging is stripped from vms and hosts though.

After roughly 20h oned now uses 4439M/2729M

The behavior is the same after replacing all nodes as well. After roughly one day all memory is locked up in oned and all swap is used as well. I don’t see anything in the log that would indicate an error. I’m open for any ideas as this is blocking our production upgrade to 5.4.1.

Hi,

OK, I’m trying to reproduce this, could you send:

1.- The output of ps aux --sort -rss for the hosts in the cluster (please
specify the role of the hosts leader/follower)

2.- How many hosts, vms are running in the system

3.- What kind of operations are you performing in the cluster

4.- The output of onezone show <zone_id>

5.- oned.log for the nodes of the cluster

If some if these contains private date feel free to PM me

Cheers

Thank you very much - pm sent

My issue with oned memory is not caused by OpenNebula but actually some OS libs. Thx Ruben for your help in tracking it down.

Thanks for reporting back!!. Just to point out that there have been a
couple of memory leaks solved thanks to the feedback provided by sulaweyo
:wink: Anybody interested in those need to update to branch one-5.4 upstream

Cheers