Improvement suggestion

darkfader · August 16, 2025, 7:37am

So, hi, I’ll keep this short.

I’m not mentioning a version since I’ve used OpenNebula since 3.x and it’s always been the same.

Just now I had been fixing a hardware failure. Turns out it took longer to move through the monitor and action timeouts and blocking transitions from actually equal states than it took to fix the second node and have it come back. You would have massive potential if you improve the valid state transitions and available states for nodes in ERROR. I know I offered to work on some of that like 10 years ago, but to be honest, I won’t. I can’t even do that anymore due to health constraints.

Nonetheless - with the advent of the modern scheduling features of 7.0, it’s high time you start cleaning up here - especially with manual instruction a node failover of ALL VMs on an unreachable or hung node node, no matter what state they are in, doesn’t need to take more than 10 seconds to be scheduled and start running.

Realistically in ONE that’s not reliably possible now. But it would help LOTS.

Examples, to clarify:

filter outgoing monitors and commands/actions do not run requests to nodes that are down EXCEPT those that only determine THAT UNLESS
do not queue more than one of the same
do not block independent requests
do not block things like a VM POWEROFF_HARD because an information driver is looking for other info

etc. This needs a generalizing approach, because there’s 100s of deadlocks in the current state of affairs. some are very specific to drivers and implementation but (imho) all would be solved by improvements in those bits that I initially pointed at.

HTH

FrancJP · August 19, 2025, 8:37am

Hello Florian (@darkfader),

Thanks for your comments, however, to share this feedback with the engineering team, I would like to clarify the following:

How long did the system take to detect and act on the node failure, versus how long you needed manually to bring the node back online?
How often does this kind of delay occur in your environment?
Would you say that optimizing transitions for nodes in ERROR state—as you suggest—could consistently shave off, say, 10 seconds or more from VM failover time

I’m aware that you collaborate heavily in the past, and now your current situation is different, so no worries on that, but I want to understand it a bit better in order to share this feedback.

Regards,

system · September 18, 2025, 8:37am

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Failed VM deployment vs. host state Product Support	0	434	June 10, 2015
Monitoring falsely reports vm in "poff" state Product Support	3	589	June 2, 2015
Host failure handling interval Product Support	4	2252	March 16, 2015
Node stuck with state MONITORING_INIT Product Support	8	2531	May 29, 2019
What are the causes let OpenNebula "thinks" VM is "POWEROFF" and how to recover? Product Support	5	738	February 18, 2015

Improvement suggestion

Related topics