Improvement suggestion

So, hi, I’ll keep this short.

I’m not mentioning a version since I’ve used OpenNebula since 3.x and it’s always been the same.

Just now I had been fixing a hardware failure. Turns out it took longer to move through the monitor and action timeouts and blocking transitions from actually equal states than it took to fix the second node and have it come back. You would have massive potential if you improve the valid state transitions and available states for nodes in ERROR. I know I offered to work on some of that like 10 years ago, but to be honest, I won’t. I can’t even do that anymore due to health constraints.

Nonetheless - with the advent of the modern scheduling features of 7.0, it’s high time you start cleaning up here - especially with manual instruction a node failover of ALL VMs on an unreachable or hung node node, no matter what state they are in, doesn’t need to take more than 10 seconds to be scheduled and start running.

Realistically in ONE that’s not reliably possible now. But it would help LOTS.

Examples, to clarify:

  • filter outgoing monitors and commands/actions do not run requests to nodes that are down EXCEPT those that only determine THAT UNLESS
  • do not queue more than one of the same
  • do not block independent requests
  • do not block things like a VM POWEROFF_HARD because an information driver is looking for other info

etc. This needs a generalizing approach, because there’s 100s of deadlocks in the current state of affairs. some are very specific to drivers and implementation but (imho) all would be solved by improvements in those bits that I initially pointed at.

HTH

Hello Florian (@darkfader),

Thanks for your comments, however, to share this feedback with the engineering team, I would like to clarify the following:

  • How long did the system take to detect and act on the node failure, versus how long you needed manually to bring the node back online?
  • How often does this kind of delay occur in your environment?
  • Would you say that optimizing transitions for nodes in ERROR state—as you suggest—could consistently shave off, say, 10 seconds or more from VM failover time

I’m aware that you collaborate heavily in the past, and now your current situation is different, so no worries on that, but I want to understand it a bit better in order to share this feedback.

Regards,

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.