CEPH parameters clarification

Hi,

I’m trying to configure ceph storage as system_ds and image_ds for first time and some parts of the documentation are a little bit confusing for me:

http://docs.opennebula.org/5.4/deployment/open_cloud_storage_setup/ceph_ds.html#ceph-datastore

In Datastore Layount I can see the following warning.

In this case context disk and auxiliar files (deployment description and chekpoints) are stored locally in the nodes.

In the nodes, where is stored this data? If this data is stored locally, live migration can be performed?

Regarding to system datastore:
ceph (only with local FS on the DS directory) --> As ceph is a distributed object, I cannot understand what does local FileSystem means on the Datastore directory…
shared for shared transfer mode (only with shared FileSystem) --> Is this a simple shared file server published by a ceph metadata server?

On the other hand:
BRIDGE_LIST: List of storage bridges to access the Ceph cluster. I have been googling a little bit but I have not been able to find references to it. What is this parameter for?

Thanks a lot and sorry for this newbee questions

Ó


Versions of the related components and OS (frontend, hypervisors, VMs): Opennbula 5.4, KVM

Steps to reproduce: N/A

Current results:N/A

Expected results:N/A

1 Like

I’ve come here with almost the same question:

In previous ONE 5.x versions if you have SYSTEM_DS and IMAGE_DS with the same TM_MAD type (ceph) then live migration worked and only VM metadata files were moving via ssh driver. But this sometimes led to errors in case of VM host failure and VM HA triggers activated (at least because of a bug I found in ceph driver, but it is already fixed).

BRIDGE_LIST is the list of the [proxy] hosts to make transition between ceph pool and external sources/destinations. For example, if you upload image via Sunstone it will be received at the Sunstone host first, then ssh’ed to BRIDGE host and from there should be uploaded to ceph pool. If you have different (non-ceph) system datastore type and keep images in ceph datastore, at the time you create VM ONE will download\convert image from ceph pool to file on one of the BRIDGE hosts and ssh it to the hypervisor host.

And my question to everybody:

Now I’m a little bit confused with ceph TM_MAD options. What is the difference between ceph TM_MAD types?
What is the best option to have images and VM disks on ceph and VMs metadata on a shared filesystem (cephfs, for ex.) to prevent any data movements in case of migration/HA triggers?
Should system storage be created with ceph or shared TM_MAD? And what about ceph image datastorage in this case?

Thanks!

Hi,

When you set TM_MAD to shared for the SYSTEM DS it is not Ceph related anymore. I it is just a shared filesystem - the Contextualization ISO,Volatile disk images and the checkpoint file during VM suspend/or clod move/ will be a QCOW2 files on the shared filesystem.

You could use altered ceph TM_MAD that do not copy data / delete data during VM migration :wink:

Edit:

There TM_MAD used for the SYSTEM_DS does not affect the behavior of the IMAGE DS.

BR,
Anton Todorov

1 Like

Anton, thanks a lot for your reply!

Yes, I get it.

I see that even when I set SYSTEM_DS with TM_MAD=shared and IMAGE_DS with DS_MAD=ceph and TM_MAD=ceph then disk images still live within ceph pool and ONE does not copy them to the system ds. But this confuses me even more :slight_smile: And I don’t understand why manual offers to create a different system_ds with TM_MAD=ceph in case of ceph image_ds if everything works with TM_MAD=shared? What behavior difference would expected between these options?

I’ve tried to dig to the drivers code but probably did it not long enough :slight_smile:

When shared is used you have a mix in the VM disks - some of them are on ceph datastore, other are files on a filesystem. With ceph (or other distributed storage :wink: ) you’ll have consistency on all VM disk’s backing store. That’s the reason to hint that it is not so hard to tweak the ceph driver to work on shared filesystem.
Our addon-storpool prove that it is possible to handle both ssh and shared backed modes - when storpool is used as TM_MAD it is matter of a configuration variable to switch between ssh or shared mode.

Personally I prefer the ssh backed mode because in our recommended setup the only difference is that you have one service to care for less.

All “metadata” (domain xml and contextualization iso) files are (re)generated on every (re)deployment of a VM so it is not issue for the VM recovery after HOST failure case.

On another hand with the ceph TM_MAD there is a corner case when the host fail just after VM state is stored ( VM suspend or cold migrate) - it is possible to lose the checkpoint file that holds the VM state.

So it all depends on your use case and/or requirements and preferences - each one has its pros and cons.

BR,
Anton Todorov

1 Like

Hi Anton,

Thanks a lot for your clarifications!

If images_ds and system_ds are ceph… Can you please advance where are contextualization files created in kvm hosts? Can they be placed in a shared fs? Does it have se nse?

The idea is to allow opennebula to run the vm in the “best” host every startup instead of keeping the host.

Thanks a lot

Hi oscar.

Looking at the ceph’s TM_MAD tm_mad/common/context The contextualization ISO image is created as a file on the front-end and then transferred to the KVM host as file. The procedure is same for ssh,shared and ceph TM_MADs.

(Hmm. so scratch the previous posts where i am saying that the context iso is on a ceph volume. There are a lot of improvements provided by our addon so I’ve totally missed counting this as another “extra”. I was thinking that it is the standard behavior for all distributed storages :frowning: )

As already said the issue with ceph’s TM_MAD is that when there is a shared filesystem underneath it is not aware of that and it is copying and deleting files when VMs are migrated or undeployed (1, 2).

Can anyone test ceph on a shared filesystem as SYSTEM DS TM_MAD but borrowing/replacing/ mv,premigrate,postmigrate and failmigrate from the shared TM_MAD? Reading the code it looks like these are the only files that mess such setup…

When you set a VM to ‘pending’ the scheduler is deciding where to run the VM.

BR,
Anton Todorov