Home cluster: best practice to protect orchestrator availability?

Hi,

I’ve been tinkering with an OpenNebula cluster on top of Gluster shared storage in my home lab (read: on the cheap ;-)). The gluster nodes (SBC’s with HDD and SSD for cache) are in replica 3, and after some minor issues I think that setup is pretty robust. I have a VM with the orchestration setup (sunstone etc) on a plain vanilla KVM PC that controls one other PC as an OpenNebula node. Now I want to bring the first PC into the cluster, but obviously that also means the orchestrator is then running inside the cluster. This will open me up to all sorts of catch-22 scenario’s in case either that VM or the hosting hypervisor fails.

So I’m looking for best practices: What is the best/simplest way to ensure the orchestrator gets revived in case of failure? Here are some options I’ve considered:

  • Full HA with a backup admin-VM on the other PC, heartbeat and all that (feels like overkill, it’s ok if the node is gone for a few minutes)
  • Keeping the admin-VM outside of the cluster (inefficient)
  • Emergency script to start the image from the KVM-node directly (without using OpenNebula controls) (would work, but how?)
  • Bringing all the moving parts in to a docker swarm maybe? (I have a three node docker swarm spread out over both PCs, containers get revived pretty quickly)

Thanks for your input,
Florian

Replying to self here :slight_smile:

It occured to me maybe the management node (Sunstone etc) could be set up on an SBC, a Raspberry Pi or similar. Anyone here tried that before?

Florian

Hi @florianoverkamp,

You can take a look to the Front-End High Availability set-up: OpenNebula Front-end HA — OpenNebula 6.0.2 documentation.

If you decide to add the raspberry, you could deploy an HA environment to take care of both VM failures and hypervisors failures, having 3 nodes (1 at the raspberry and 1 at each hypervisor node).
This way if one of the nodes (either VM or entire hypervisor node) fails you still have at least two of them up, which is enough for keeping the cluster running.

1 Like