[SOLVED] Ceph - Can't read/write after a live disk snapshot

I successfully made a live disk snapshot on a running Linux VM with qemu-guest-agent installed :

Sat Dec 19 15:51:37 2015 [Z0][VM][I]: New LCM state is RUNNING
Sat Dec 19 15:51:56 2015 [Z0][VM][I]: New state is ACTIVE
Sat Dec 19 15:51:56 2015 [Z0][VM][I]: New LCM state is DISK_SNAPSHOT
Sat Dec 19 15:51:57 2015 [Z0][VMM][I]: Successfully execute transfer manager driver operation: tm_snap_create_live.
Sat Dec 19 15:51:57 2015 [Z0][VMM][I]: VM disk snapshot successfully created.
Sat Dec 19 15:51:57 2015 [Z0][VM][I]: New LCM state is RUNNING
Sat Dec 19 15:51:57 2015 [Z0][LCM][I]: VM disk snapshot operation completed.

But after the snapshot, if I try to reboot the system using the reboot command in the guest system, my SSH session closes but the system won’t reboot and never comes back online. If I don’t try to reboot the system and just try to access the disk (read or write, ls, cat, touch, …), my session freezes.

I don’t have this problem if I STOP or SUSPEND the VM before doing the disk snapshot.

Is the live disk snapshot with Ceph backend is possible?

Thank you.
ONE 4.14.2 with Ceph 0.94.5

Yes it is supported, note that if you do not revert the operation is just a
rbd snapshot operation on the disk, that should not alter the status of the
disk image … A couple of things to check:

1.- If you updated from a previous version, perform a onehost sync
2.- Maybe you can try to install qemu-guest-agent, the driver is actually
trying a do a domfsfreeze.

Cheers

Thanks for your quick help.

The qemu-guest-agent is already installed and running. The hosts are also synced.

Note that I’m trying to snapshot the system disk (vda, ID-0). Maybe the Ceph live disk snapshot is only possible on non-system target disk?

I solved this problem by setting Cache=writethrough for the storage disk 0 (system disk). I can now live snapshot the disk without freezing the system. Even with the qemu-guest-agent running, the system freezed after a disk snapshot if the disk Cache option is left blank.

…as explained in the doc :

Warning
Depending on the CACHE the live snapshot may or may not work correctly. For more security use CACHE=writethrough although this delivers the slowest performance.

Thanks for letting us know!