CONF 2017 / Storage Management: biggest pain to report in daily use of ONE

Hello,

Ahead of next ONE Confs, I wanted to share feedback after a few months of usage (or trial of usage) of ONE on a KVM/NFS setup, v 5.0.2 as displayed by Sunstone.

I selected Support as this post category hoping that workarounds to my problem exist, but it might fall back to development otherwise, which might hopefully be discussed in CONF 2017 sessions.

In short, as a growing, but still modest size, cloud computing company, I need to manage scarce resources and grow them as needed.

RAM and CPU are the easiest part to manage because this linux/KVM/ONE stack handles them beautifully. Network is pretty neat too :slight_smile: . Thanks guys for great work here, really !

However, storage is the pain for following reasons:

  1. Customers first: storage is their most critical resource, absolutely no risk is acceptable there
  2. Performance: it’s the slowest resource to clone and secure even on multi-Gbps networks
  3. Reliability and dependability: this stack has many flaws: a/ hardware fail oftentimes, b/ softwares like qcow2 files do not auto-shrink and waste space big time if you don’t stop them to reclaim wasted space (or do special configuration at the beginning), c/ VM snapshots disappear without warning when you turn-off the VM from Sunstone, d/“live” VM or disk snapshots actually hang the VM during snapshot creation, which could take 15 minutes even on a Copy-On-Write storage, to name a few
  4. ONE clients: basic storage space management screens are lacking (hint: list of VM/images using how much space from a datastore so badly needed feature !). After months of usage, the datastores pages become a big list of images with difficulty to see useful ones from obsolete ones (despite the USED status), and no clue howmuch space they really use on disk (a 10G qcow2 image is using 27G on disk, and this without snapshots).
  5. ONE core: basic and advance storage management features are lacking: ability to move images across datastores (you have to copy and delete after, rename, re-attach to VM, blabla…), including live migration; ability to balance datastore usage live by live-moving files when reaching a usage ceiling (VMWARE DRS-like feature).

If anyone has a pointer to existing ONE features or workarounds for reasons 4 and 5, please reply !

In other case, I strongly recommend these features for next development discussions. I believe they could contribute to establish a lasting leadership of ONE on this competitive field. And the number 4 seems so easy to do as all information is already in Sunstone…

Thanks an regards,

Kipitapp cloud support team.

Some of the pain points you see are not related to OpenNebula, but to NFS and its inability to perform some of the things you need from it. We have had quite a good result from coupling OpenNebula with a distributed filesystem (we use LizardFS, but StorPool is another quite good one with excellent integration. Ceph is also quite good).
On your points:

  1. I agree totally. User data is critical. That’s why we don’t use NFS :slight_smile:
  2. NFS is certainly unable to perform instantaneous snapshots, but DFSs are…
  3. You have conflated several points in one;
  • hardware fails sometimes: true. That’s why you should use replication instead of a single point of failure;
  • qcow2 do not auto shrink: true as well. It’s actually qcow2’s fault, not opennebula’s; if you need to use qcow2, you have to use a shrink script that checks when a vm is not used, and performs a repack of the image.
  • VM snapshots disappear: this is how qemu/kvm handle the snapshots (mediated by libvirt). Again, it is not something related to opennebula. Other virtual platform do the same thing, if they use kvm.
  • disk snapshots hang the VM: yes, because NFS does not know how to do an atomic snapshot. Both Lizard, Ceph or Storpool do.
  1. Version 4 had in effect little control on images. ON5 adds labels, that allows for showing only the images that you need. If you have lots of “orphan” images, you can use a culling script that deletes images that satisfy some condition. It is very easy to do - we simply do an oneimage list, parse the XML output and delete what is needed.
    As for space, check that you are using the right remotes. The datastore “stat” remote should return the true size of your image; look its output to see why it does not. (some more details on the Storage driver functions is here: https://docs.opennebula.org/5.2/integration/infrastructure_integration/sd.html )
  2. Some of hte feature you ask for would be nearly impossible to implement with different datastores. For example, you cannot live migrate an image from a NFS datastore to an SSH one (just one of the many possible examples). If your datastores are all under a single kind of transfer manager it is possible- we do it all the time with LizardFS; we can do live migrations and image copy with no stops (we don’t do a simple “move” because the datastores have different goal/replication properties, and a delete must be explicitly asked by the user).
    The same applies to “balance datastore usage” - with a DFS you don’t need to balance, you just see a single data pool, and you partition into datastore only for quota/access control needs.
    hope to have helped a little bit,
    cheers
    Carlo Daffara
    NodeWeaver