Slow query detected in OpenNebula Sunstone

Hi,
I have configured OpenNebula sunstone with MySQL backend and running HA with 3 servers. Sometimes in Sunstone server log, I am getting the error “Slow query (3.53s) detected: DELETE FROM logdb WHERE timestamp > 0 AND log_index >= 0 AND fed_index = -1 AND log_index < ( SELECT MIN(i.log_index) FROM ( SELECT log_index FROM logdb WHERE fed_index = -1 AND timestamp > 0 AND log_index >= 0 ORDER BY log_index DESC LIMIT 500000 ) AS i) LIMIT 100000”. Also also getting that running HA service migration master Virtual IP frequently.

Regards
Mosharaf

1 Like

As the log shows, you are dealing with a slow query. This means that your database is not performing as expected, and it might be due to a slow storage backend or lack of resources.

is there any recommendation of MySQL to run with OpenNebula?? Currently our storage/db cluster are working properly for another services. In my understating, normally services are working properly but whilst database size are increasing, problem are starting.

If you have any recommendation, please share to apply to check.

Recommendations for OpenNebula’s database are not much different from other databases. You can tweak a lot of parameters but you will experience the biggest improvements by increasing the access speed to your data. By default mariadb only uses 128MB for innodb_buffer_pool_size. OpenNebula’s database is not big so you can easily fit it entirely inside RAM by tuning this value. If you are sharing the server with other databases, you will also have to take them into account.

1 Like

I Have the same issue:

Slow query (1.65s) detected: DELETE FROM logdb WHERE timestamp > 0 AND log_index >= 0 AND fed_index = -1 AND log_index < (  SELECT MIN(i.log_index) FROM (    SELECT log_index FROM logdb WHERE fed_index = -1 AND      timestamp > 0 AND log_index >= 0       ORDER BY log_index DESC LIMIT 500000  ) AS i) LIMIT 100000

But I have innodb_buffer_pool set to 2G. When the query runs, it uses mostly CPU, not disk IO.
an explain, show the following:

+----+-------------+------------+------------+-------------+-------------------------------------+-----------------------+---------+-------+--------+----------+-----------------------------------------------------+
| id | select_type | table      | partitions | type        | possible_keys                       | key                   | key_len | ref   | rows   | filtered | Extra                                               |
+----+-------------+------------+------------+-------------+-------------------------------------+-----------------------+---------+-------+--------+----------+-----------------------------------------------------+
|  1 | PRIMARY     | logdb      | NULL       | index_merge | PRIMARY,fed_index_idx,timestamp_idx | fed_index_idx,PRIMARY | 9,4     | NULL  |    134 |    50.00 | Using intersect(fed_index_idx,PRIMARY); Using where |
|  2 | SUBQUERY    | <derived3> | NULL       | ALL         | NULL                                | NULL                  | NULL    | NULL  |  68213 |   100.00 | NULL                                                |
|  3 | DERIVED     | logdb      | NULL       | ref         | PRIMARY,fed_index_idx,timestamp_idx | fed_index_idx         |  5      | const | 272853 |    25.00 | Using where                                         |
+----+-------------+------------+------------+-------------+-------------------------------------+-----------------------+---------+-------+--------+----------+-----------------------------------------------------+

Maybe there is a way to optimize the query ?

We’re facing similar issue with same symptoms.
However, in our case it seems obvious that it’s swapping issue.

oned is using almost 20GiB of RAM which is a bit surprising.
RAM usage spiked in the last week of 2018 and is very high since (all 8GiB of RAM + 12GiB of swap).
There were no changes or updates (other than security kernel update) on the frontend.
(By the way, for sake of completeness, we did test with previous kernel and it is the same).

Our deployment didn’t increase as much in size as RAM usage would suggest.

As soon as frontend is booted, we see couple of hundreds of megabytes used per second.
When RAM is fully used, we start seeing slow queries (which makes sense).
However, it is unclear why would frontend use that much ram.

We have 570 VMs and 33 hosts.

Already tried increasing innodb_buffer_pool size, but that doesn’t affect anything as expected.

Any troubleshooting hints are more than welcome.

Thanks guys, we are already improving the performance of the query, as it seems that the nested select in the delete command is not using the existing indexes.

As for the memory usage I guess you 20G could be the vitrual size, you can get the resident size with, for example:

 ps -a -o pid,rss,vsz,cmd | grep oned

@ruben
Output of command ps -ax -o pid,rss,vsz,cmd | grep oned is:

25242 6830256 19803136 /usr/bin/oned -f

Not really sure what to read from this, but, this output might be more relevant:

root@frontend:~# free -m
              total        used        free      shared  buff/cache   available
Mem:           7807        7592         120           0          94          30
Swap:         32767       11071       21696
root@frontend:~# systemctl stop opennebula
root@frontend:~# free -m
              total        used        free      shared  buff/cache   available
Mem:           7807         772        6909           0         124        6834
Swap:         32767         726       32041
root@frontend:~#

Yes this is 6GB of actual (resident) memory use of oned, not 20GB.

About the leader migration, you can increase the election timeout.

@ruben

Please take into account amount of RAM and swap, before and after stopping OpenNebula service.
Difference is roughly at 15GiB for that specific output.

I agree that oned is using 6GiB of RAM, but imho, that’s just because the server has no more. If it had 32GiB, I’m pretty sure it would shoot up higher.

Any troubleshooting help would be highly appreciated as we’re starting to face issues with automated tasks since ONE can become rather slow (due to paging).

Issue prior to memory upgrade:

root@frontend:~# cat /proc/sys/vm/swappiness
1
root@frontend:~# free -m
total used free shared buff/cache available
Mem: 7807 7606 104 0 96 15
Swap: 32767 13547 19220
root@frontend:~# ps -ax -o pid,rss,vsz,cmd | grep oned
4633 6728560 20001408 /usr/bin/oned -f
root@frontend:~# systemctl stop opennebula
root@frontend:~# free -m
total used free shared buff/cache available
Mem: 7807 965 6539 0 301 6586
Swap: 32767 1778 30989
root@frontend:~#

Issue after memory upgrade:

root@frontend:~# uptime
11:14:58 up 4 days, 16:39, 2 users, load average: 1.88, 1.55, 1.33
root@frontend:~# cat /proc/sys/vm/swappiness
1
root@frontend:~# free -m
total used free shared buff/cache available
Mem: 31999 27978 751 0 3269 3561
Swap: 32767 813 31954
root@frontend:~# ps -ax -o pid,rss,vsz,cmd | grep oned
2590 26059884 28377016 /usr/bin/oned -f
17786 1092 13136 grep --color=auto oned
root@frontend:~# systemctl stop opennebula
root@frontend:~# free -m
total used free shared buff/cache available
Mem: 31999 2289 26090 0 3618 29250
Swap: 32767 150 32617
root@frontend:~#

However, upgrading to OpenNebula 5.8 resolved the problem.

Cheers!

We are having the same issue at the moment with our OpenNebula cluster (not the memory issue, but the original post).

We get a Slow Query error every 2 minutes:
Wed Mar 27 12:21:13 2019 [Z0][SQL][W]: Slow query (15.05s) detected: DELETE FROM logdb WHERE timestamp > 0 AND log_index >= 0 AND fed_index = -1 AND log_index < ( SELECT MIN(i.log_index) FROM ( SELECT log_index FROM logdb WHERE fed_index
= -1 AND timestamp > 0 AND log_index >= 0 ORDER BY log_index DESC LIMIT 100000 ) AS i) LIMIT 100000
and multiple times a day our raft cluster is failing over to one of the followers. We tried upping the timeout, after receiving that recommendation from support, but to no avail.

Mosharaf, Jean, are you still having this issue? Have you found the root cause or at least a fix?

This query have been improved in OpenNebula 5.8, and the execution time have been considerably reduced. I encourage you to try out this new version.

We tweaked the database settings so the database is all in ram now and the query is already much faster. We don’t get leadership elections constantly now so the worst is averted. We’ll upgrade to 5.8 after some migrations we are busy with.

In some blog, I find the database issues has been solved in version 5.8. I am now planning to move on the new version and hope it will solve in there.

Will update here after the upgrade.

We had this problem too, HA leader was migrating every two minutes. We use One 5.8.2.

We solve this by tweaking innodb_buffer_pool_size on each frontend:

innodb_buffer_pool_size = 2G

It’s been a week now and the leader has not migrated once. We still have Slow query warnings in the log but the query time has gone from 4s to 0.5s.