Issues with incremental backups using rsync datastore

Hello All,

I am doing some tests with OpenNebula in a hyperconverged 3-node cluster, and I am kind of stuck with an issue with incremental backups that “hang” during the first, or after several subsequent incremental backups. Full backups are working, and incremental backup work sometimes, so I do not expect an issue with permissions etc. (as the underlying tools for full or incremental backups seem to be the same). I wonder if there is a race condition somewhere that triggers more likely with large disk images and also with CBT enabled. Since it is a hyperconverged cluster with all the convoluted stacks (and problems), I do not expect that somebody can solve my problem straight away, but maybe somebody can guide me through how to debug this issue properly. Anyway, any hint that drives me into the right direction to solve this issue is welcome, as running full backups all the time is not an option.


Setup
The system is a hyperconverged cluster with three nodes, running CEPH squid (19.2.3) deployed with cephadm. The host OS is Ubuntu server 24.04.3 LTS with the root partition on a ZFS filesysem. OpenNebula is 7.0.1 CE and it is running in a VM in the cluster (OS also Ubuntu server 24.04.3). I had configured OpenNebula in a HA setup with three nodes, but with the backup issues I switched it back to standalone mode.

Steps to reproduce:

  • Instantiate a VM and deploy it on CEPH_SYSTEM datastore. The VMs I am working with are “individual” ones, and are based on persistent images that are on the CEPH image datastore.
  • Configure Backup:
    • Volatile: NO / FS Freeze: AGENT / Keep: 7 / Mode: INCREMENT / INCREMENT_MODE: SNAPHOST
      I seem go get this issue no matter what settings I choose. However, the “success rate” for incremental backups seems to be higher for SNAPSHOT INCREMENT_MODE and smaller disk images.
  • Run a full backup with RESET
  • Run subsequent backup one or several times, until it “hangs”

Current results:

  • The VM changes the mode into New LCM state is BACKUP
  • I see on CEPH that the snapshot one/one-19@one_backup_1 is generated and protected
  • On the KVM host the /tmp directory contains many rbd-diff files, in this case 700, with a larger file with extension *.rdiff’
-rw-------  1 oneadmin oneadmin      4096 Jan 23 08:21 rbd-diff20260123-2062606-zywke4
-rw-r--r--  1 oneadmin oneadmin 530874019 Jan 23 08:21 rbd-diff-2062606-1769152863.rdiff
  • I see a qcow2 file in the temporary backup location that seems to be correct:
root@louie:/var/lib/one/datastores/100/47/backup# qemu-img info -U disk.0.1
image: disk.0.1
file format: qcow2
virtual size: 30 GiB (32212254720 bytes)
disk size: 34 MiB
cluster_size: 65536
backing file: rbd:one/one-19@one_backup_0:id=libvirt
backing file format: rbd
Format specific information:
    compat: 1.1
    compression type: zlib
    lazy refcounts: false
    refcount bits: 16
    corrupt: false
    extended l2: false
Child node '/file':
    filename: disk.0.1
    protocol type: file
    file length: 143 MiB (150274048 bytes)
    disk size: 34 MiB
root@louie:/var/lib/one/datastores/100/47/backup# qemu-img check -U disk.0.1
No errors were found on the image.
2270/491520 = 0.46% allocated, 0.00% fragmented, 0.00% compressed clusters
Image end offset: 150274048
  • On the KVM host the ruby script backup_rbd.rb is running with plenty of child process, but seems to be just waiting for something:
oneadmin 2062606  0.0  0.1 889476 546344 ?       Sl   08:21   0:00 ruby /var/tmp/one/tm/lib/backup_rbd.rb one/one-19 one_backup_0 one/one-19@one_backup_1 /var/lib/one//datastores/100/47/backup/disk.0.1
oneadmin 2062830  0.0  0.0 1385688 28416 ?       Sl   08:21   0:02 qemu-nbd -k /var/lib/one/datastores/100/47/backup/disk.0.1.socket -f qcow2 -t /var/lib/one//datastores/100/47/backup/disk.0.1
oneadmin 2063125  0.0  0.0 233416  8460 ?        Sl   08:21   0:00 qemu-io -f raw nbd+unix:///?socket=/var/lib/one/datastores/100/47/backup/disk.0.1.socket
  • The backup “hangs” forever

Expected results:
Like with the full backup, I expect the backup_rbd.rb to terminate and the temporary image in /var/lib/one//datastores/100/47/backup/disk.0.1 to be transferred to the RSYNC_HOST / datastore by rsync.

Hello @teletoby,

Welcome to this forum, and apologies for the delayed reply.

Unfortunately this behaviour is somehow expected, especially when using snapshot-based incremental modes against complex backends like Ceph + rsync.

I suggest you to check our documentation about backups:
Backup Operations

Also, on the logs, please check if there’s any timeout message.

If you can share logs around the point where the incremental stops progressing — or confirm whether smaller disk images complete consistently — that would help narrow down whether this is a backend IO timing issue, an rsync transport issue, or something internal to OpenNebula’s incremental driver.

Cheers,

Hello Francisco,

Thanks for the reply. Sorry for the delay with my follow up. I was trying figure out alternative ways in the meantime, but I was not realizing that “I am walking on such shallow ice”… I tried using the native CBT feature of libvirt by using a script that wraps a proxy qcow2 file in front of my RBD images (using the data_file property) at the deploy stage of the VM in OpenNebula. But I was not realizing that these are features that are very recent (I use libvirt 10.0.0), and apparently not supported through the whole stack. So, while the higher level mechanism seems to work, the approach fails on my system because apparently qemu / libvirt does not realize that it needs to use the RBD driver when it finds a qcow2 file with a data_file setting that starts with rbd:.

What concerns alternative methods, I still have benji backup as an option, but was initially not very keen on using, because generating a benji configuration on each KVM host seemed like a headache at first. But injecting the benji configuration directives using the RAW data option of VM templates might be a way to do this in a clean way.

And back to the OpenNebula backup, here some further explanations:

For example, I do a full backup with a reset on VM38. An incremental backup (snapshot) that is started a couple of minutes later (on small VM, around 30GB) usually terminates OK.

Then I start an incremental backup (snapshot) after a couple of hours, typically the day after. The VM switches to BACKUP state as expected: vm.log

Sat Jan 31 08:24:15 2026 [Z0][VM][I]: New LCM state is BACKUP
Sat Jan 31 08:24:42 2026 [Z0][VMM][I]: Ignoring VM state update

Then the the backup jobs hangs ad infinitum with the characteristics mentioned in my initial message. Then, after a long wait with the “prepare backup folder job” in sleeping state, I kill the parent ruby process on the KVM host, and I get plenty of messages in the vm.log:

Sat Jan 31 08:33:24 2026 [Z0][VMM][I]: Command execution failed (exit code: 143): /var/lib/one/remotes/tm/ceph/prebackup_live dewey.tweax.net:/var/lib/one//datastores/100/38 0 529563bf-72ec-4071-b3bb-413
b2e162c5e - 38 100
Sat Jan 31 08:33:24 2026 [Z0][VMM][I]: Command failed:
Sat Jan 31 08:33:24 2026 [Z0][VMM][I]: export LANG=C
Sat Jan 31 08:33:24 2026 [Z0][VMM][I]: export LC_ALL=C
Sat Jan 31 08:33:24 2026 [Z0][VMM][I]: set -ex -o pipefail
Sat Jan 31 08:33:24 2026 [Z0][VMM][I]:
Sat Jan 31 08:33:24 2026 [Z0][VMM][I]: # ----------------------
Sat Jan 31 08:33:24 2026 [Z0][VMM][I]: # Prepare backup folder
Sat Jan 31 08:33:24 2026 [Z0][VMM][I]: # ----------------------
Sat Jan 31 08:33:24 2026 [Z0][VMM][I]: [ -d /var/lib/one//datastores/100/38/backup ] && rm -rf /var/lib/one//datastores/100/38/backup
Sat Jan 31 08:33:24 2026 [Z0][VMM][I]:
Sat Jan 31 08:33:24 2026 [Z0][VMM][I]: mkdir -p /var/lib/one//datastores/100/38/backup
Sat Jan 31 08:33:24 2026 [Z0][VMM][I]:
Sat Jan 31 08:33:24 2026 [Z0][VMM][I]: echo "PFZNPgogIDxJRD4zODwvSUQ+CiAgPFVJRD4wPC9VSUQ+CiAgPEdJRD4wPC9H
...
Sat Jan 31 08:33:24 2026 [Z0][VMM][I]: QUNLVVBTPgo8L1ZNPg==
Sat Jan 31 08:33:24 2026 [Z0][VMM][I]: " > /var/lib/one//datastores/100/38/backup/vm.xml
Sat Jan 31 08:33:24 2026 [Z0][VMM][I]:
Sat Jan 31 08:33:24 2026 [Z0][VMM][I]: # --------------------------------
Sat Jan 31 08:33:24 2026 [Z0][VMM][I]: # Create snapshots for disks
Sat Jan 31 08:33:24 2026 [Z0][VMM][I]: # --------------------------------
Sat Jan 31 08:33:24 2026 [Z0][VMM][I]:
Sat Jan 31 08:33:24 2026 [Z0][VMM][I]:
Sat Jan 31 08:33:24 2026 [Z0][VMM][I]: rbd --id libvirt snap create one/one-17@one_backup_2
Sat Jan 31 08:33:24 2026 [Z0][VMM][I]: rbd --id libvirt snap protect one/one-17@one_backup_2
Sat Jan 31 08:33:24 2026 [Z0][VMM][I]:
Sat Jan 31 08:33:24 2026 [Z0][VMM][I]:
Sat Jan 31 08:33:24 2026 [Z0][VMM][I]:
Sat Jan 31 08:33:24 2026 [Z0][VMM][I]:
Sat Jan 31 08:33:24 2026 [Z0][VMM][I]:
Sat Jan 31 08:33:24 2026 [Z0][VMM][I]:
Sat Jan 31 08:33:24 2026 [Z0][VMM][I]: ssh 'dewey.tweax.net' '    script="$(mktemp)";     echo "c2V0IC1leCAtbyBwa
...
oKCgo=" | base64 -d > "$script";     trap "rm $script" EXIT;     bash "$script"; '
Sat Jan 31 08:33:24 2026 [Z0][VMM][I]:
Sat Jan 31 08:33:24 2026 [Z0][VMM][I]: set -ex -o pipefail
Sat Jan 31 08:33:24 2026 [Z0][VMM][I]:
Sat Jan 31 08:33:24 2026 [Z0][VMM][I]: # --------------------------
Sat Jan 31 08:33:24 2026 [Z0][VMM][I]: # Cleanup snapshots
Sat Jan 31 08:33:24 2026 [Z0][VMM][I]: # --------------------------
Sat Jan 31 08:33:24 2026 [Z0][VMM][I]: rbd --id libvirt snap ls one/one-17 --format json |     jq -r '.[] | select(.name == "one_backup_1") .name' |     xargs -rn1 sh -c '         rbd --id libvirt snap unprotect one/one-17@$1 > /dev/null 2>&1 || true;         rbd --id libvirt snap rm one/one-17@$1;     ' sh
Sat Jan 31 08:33:24 2026 [Z0][VMM][I]:
Sat Jan 31 08:33:24 2026 [Z0][VMM][I]:
Sat Jan 31 08:33:24 2026 [Z0][VMM][E]: prebackup_live: Command failed:
Sat Jan 31 08:33:24 2026 [Z0][VMM][E]: prebackup_live: export LANG=C
Sat Jan 31 08:33:24 2026 [Z0][VMM][E]: prebackup_live: export LC_ALL=C
Sat Jan 31 08:33:24 2026 [Z0][VMM][E]: prebackup_live: set -ex -o pipefail
Sat Jan 31 08:33:24 2026 [Z0][VMM][E]: prebackup_live:
Sat Jan 31 08:33:24 2026 [Z0][VMM][E]: prebackup_live: # ----------------------
Sat Jan 31 08:33:24 2026 [Z0][VMM][E]: prebackup_live: # Prepare backup folder
Sat Jan 31 08:33:24 2026 [Z0][VMM][E]: prebackup_live: # ----------------------
Sat Jan 31 08:33:24 2026 [Z0][VMM][E]: prebackup_live: [ -d /var/lib/one//datastores/100/38/backup ] && rm -rf /var/lib/one//datastores/100/38/backup
Sat Jan 31 08:33:24 2026 [Z0][VMM][E]: prebackup_live:
Sat Jan 31 08:33:24 2026 [Z0][VMM][E]: prebackup_live: mkdir -p /var/lib/one//datastores/100/38/backup
Sat Jan 31 08:33:24 2026 [Z0][VMM][E]: prebackup_live:
Sat Jan 31 08:33:24 2026 [Z0][VMM][E]: prebackup_live: echo "PFZNPgogIDxJRD4zODwvSUQ+CiAgPFVJRD4wPC9VSUQ+CiAgPEdJRD4wPC9H
...
Sat Jan 31 08:33:25 2026 [Z0][VMM][I]: QUNLVVBTPgo8L1ZNPg==
Sat Jan 31 08:33:25 2026 [Z0][VMM][I]: '
Sat Jan 31 08:33:25 2026 [Z0][VMM][I]: + rbd --id libvirt snap create one/one-17@one_backup_2
Sat Jan 31 08:33:25 2026 [Z0][VMM][I]: Creating snap: 10% complete...^MCreating snap: 100% complete...done.
Sat Jan 31 08:33:25 2026 [Z0][VMM][I]: + rbd --id libvirt snap protect one/one-17@one_backup_2
Sat Jan 31 08:33:25 2026 [Z0][VMM][I]: + ssh dewey.tweax.net '    script="$(mktemp)";     echo "c2V0IC1leCAtbyBwaXBlZmFpbAoKIyAtLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLQojIFNhdmUgVFBNIHN0YXRlCiM
gLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0KCgojIC0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tCiMgRXhwb3J0LCBjb252ZXJ0ICYgY2xlYW51cAojIC0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tClsgLWQgL3Zhci9saWIvb25lLy9kYXRhc3
RvcmVzLzEwMC8zOC9iYWNrdXAgXSB8fCBta2RpciAtcCAvdmFyL2xpYi9vbmUvL2RhdGFzdG9yZXMvMTAwLzM4L2JhY2t1cAoKZW52IENFUEhfVVNFUj0ibGlidmlydCIgcnVieSAvdmFyL3RtcC9vbmUvdG0vbGliL2JhY2t1cF9yYmQucmIgb25lL29uZS0xNyBvbmVfY
mFja3VwXzEgb25lL29uZS0xN0BvbmVfYmFja3VwXzIgL3Zhci9saWIvb25lLy9kYXRhc3RvcmVzLzEwMC8zOC9iYWNrdXAvZGlzay4wLjIKCgpjZCAvdmFyL2xpYi9vbmUvL2RhdGFzdG9yZXMvMTAwLzM4L2JhY2t1cAoKCgo=" | base64 -d > "$script";     t
rap "rm $script" EXIT;     bash "$script"; '
Sat Jan 31 08:33:25 2026 [Z0][VMM][I]: + '[' -d /var/lib/one//datastores/100/38/backup ']'
Sat Jan 31 08:33:25 2026 [Z0][VMM][I]: + env CEPH_USER=libvirt ruby /var/tmp/one/tm/lib/backup_rbd.rb one/one-17 one_backup_1 one/one-17@one_backup_2 /var/lib/one//datastores/100/38/backup/disk.0.2
Sat Jan 31 08:33:25 2026 [Z0][VMM][I]: backup_rbd.rb: Starting Ceph full backup.
Sat Jan 31 08:33:25 2026 [Z0][VMM][I]: Terminated
Sat Jan 31 08:33:26 2026 [Z0][VMM][I]: Successfully execute transfer manager driver operation: postbackup_live.
Sat Jan 31 08:33:26 2026 [Z0][VMM][I]: Failed to execute transfer manager driver operation: prebackup_live.
Sat Jan 31 08:33:26 2026 [Z0][VMM][E]: BACKUP: Command failed:export LANG=Cexport LC_ALL=Cset -ex -o pipefail# ----------------------# Prepare backup folder# ----------------------[ -d /var/lib/one//data
stores/100/38/backup ] && rm -rf /var/lib/one//datastores/100/38/backupmkdir -p /var/lib/one//datastores/100/38/backupecho "PFZNPgogIDxJRD4zODwv
...
CA8SUQ+MTIxPC9JRD4KICAgICAgPElEPjEyNDwvSUQ+CiAgICA8L0JBQ0tVUF9JRFM+CiAgPC9CQUNLVVBTPgo8L1ZNPg=='+ rbd --id libvirt snap create one/one-17@one_backup_2Creating snap: 10% complete...^MCreating snap: 100% complete...done.+ rbd --id libvirt snap protect one/one-17@one_backup_2+ ssh dewey.tweax.net '    script="$(mktemp)";     echo "c2V0IC1leCAtbyBwaXBlZmFpbAoKIyAtLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLQojIFNhdmUgVFBNIHN0YXRlCiMgLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0KCgojIC0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tCiMgRXhwb3J0LCBjb252ZXJ0ICYgY2xlYW51cAojIC0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tClsgLWQgL3Zhci9saWIvb25lLy9kYXRhc3RvcmVzLzEwMC8zOC9iYWNrdXAgXSB8fCBta2RpciAtcCAvdmFyL2xpYi9vbmUvL2RhdGFzdG9yZXMvMTAwLzM4L2JhY2t1cAoKZW52IENFUEhfVVNFUj0ibGlidmlydCIgcnVieSAvdmFyL3RtcC9vbmUvdG0vbGliL2JhY2t1cF9yYmQucmIgb25lL29uZS0xNyBvbmVfYmFja3VwXzEgb25lL29uZS0xN0BvbmVfYmFja3VwXzIgL3Zhci9saWIvb25lLy9kYXRhc3RvcmVzLzEwMC8zOC9iYWNrdXAvZGlzay4wLjIKCgpjZCAvdmFyL2xpYi9vbmUvL2RhdGFzdG9yZXMvMTAwLzM4L2JhY2t1cAoKCgo=" | base64 -d > "$script";     trap "rm $script" EXIT;     bash "$script"; '+ '[' -d /var/lib/one//datastores/100/38/backup ']'+ env CEPH_USER=libvirt ruby /var/tmp/one/tm/lib/backup_rbd.rb one/one-17 one_backup_1 one/one-17@one_backup_2 /var/lib/one//datastores/100/38/backup/disk.0.2Terminated
Sat Jan 31 08:33:26 2026 [Z0][VM][I]: New LCM state is RUNNING

If I try to understand what I find in the logs, it looks like creating the snapshot works fine. Then it’s the script:

CEPH_USER=libvirt ruby /var/tmp/one/tm/lib/backup_rbd.rb one/one-17 one_backup_1 one/one-17@one_backup_2 /var/lib/one//datastores/100/38/backup/disk.0.2

That keeps waiting. There is no timeout. The script is in sleep state until I kill it manually. The rsync transport never gets hit because this script does not terminate with success.

Best regards,
Tobias

Hi @teletoby

Apologies for the late reply… from what I can see on the logs, they clearly show the hang occurs before rsync ever runs… in the snapshot/export step executed by the backup script. This matches known behaviour with incremental backups under the rsync driver.

I’m asking more help to the engineering team, stay tuned.

Hello,

No worries. I must admit I did not further investigate on this issue, nor on a solution using benji backup. For the moment I can live with a solution where I do full backups for the important VMs, and I will come back to this issue once I have more time.

Best regards,
Tobias