So, hi, I’ll keep this short.
I’m not mentioning a version since I’ve used OpenNebula since 3.x and it’s always been the same.
Just now I had been fixing a hardware failure. Turns out it took longer to move through the monitor and action timeouts and blocking transitions from actually equal states than it took to fix the second node and have it come back. You would have massive potential if you improve the valid state transitions and available states for nodes in ERROR. I know I offered to work on some of that like 10 years ago, but to be honest, I won’t. I can’t even do that anymore due to health constraints.
Nonetheless - with the advent of the modern scheduling features of 7.0, it’s high time you start cleaning up here - especially with manual instruction a node failover of ALL VMs on an unreachable or hung node node, no matter what state they are in, doesn’t need to take more than 10 seconds to be scheduled and start running.
Realistically in ONE that’s not reliably possible now. But it would help LOTS.
Examples, to clarify:
- filter outgoing monitors and commands/actions do not run requests to nodes that are down EXCEPT those that only determine THAT UNLESS
- do not queue more than one of the same
- do not block independent requests
- do not block things like a VM POWEROFF_HARD because an information driver is looking for other info
etc. This needs a generalizing approach, because there’s 100s of deadlocks in the current state of affairs. some are very specific to drivers and implementation but (imho) all would be solved by improvements in those bits that I initially pointed at.
HTH