I wanted to share to issues we experienced with fs_lvm, and how we mitigated it.
VM migration was broken. The datastore is expected to be shared, even if it the documentation say it doesn’t need to be. And if we share it, then volatile disks are stored on the NFS host, not on the local machine.
Upon host restart the VMs couldn’t reboot. The VMs directory structure was still there and in good shape, but the LV was no longer activated, causing the VM to fail to boot.
fs_lvm force raw image conversion from qcow2 during clone. This would break thin provisioning (if our SAN supported it) and also causing longer “prolog” since, the full disk (filled with zero) would be copied instead of than the very small image (8GB vs 600MB, in our case)
Since the image store “TM_MAD” is used to trigger the lvm “copy” we need two image stores. One standard for everything else (local storage, etc.) and a second one just for fs_lvm. And this is even if the images are not really stored on LVM but are both on local storage anyway.
Modified the MV script to actually copy (via rsync + ssh) the symlinks and volatile disks before activating the LV.
Added a hook on VM start, to double-check that the LV is activated, if a LV symlink is present and but broken.
Modified the clone script to use dd instead of qemu-img convert. Also, we run qemu-resize to fix the qcow image size.
We didn’t fix this one yet, but I included it anyway. Looking at the code, I understand why this is. It would seem natural, however, than if we select an SSH or Shared image and we select the VM to be deployed on a “fs_lvm” datastore, it should work. Do you have any plan on improving this? We thought of making a new “wrapper” TM, say “ssh+fs_lvm”, that would wrap SSH and FS_LVM together and trigger the real TM according to the destination DS TM .