Live Migration of busy VMs

Hello Community,

we run several very busy VMs with large dirty pages and when we had to live migrate them to another host
(for example in the case of maintenance of a VMM host) we run into problems (migration job runs forever).

Of course we modified the migrate driver ( /var/lib/one/remotes/vmm/kvm/migrate ) to use more bandwidth
and played with libvirt parameters like adding “–timeout seconds”. Currently we use up to 500 MB/s for
live migration traffic on our 2x 10G VMM network with the following addition in the driver:

exec_and_log "virsh --connect $LIBVIRT_URI migrate-setspeed $deploy_id 500" \
    "Setspeed to 500M is set!"

But we also had to set virsh migrate-setmaxdowntime one-$id downtime - but this is not possible because of the serial operation of the driver (BASH script).

Is there any other option setting migration speed and max-downtime on the fly?

Currently we do this by manual intervention when busy VMs had to be live-migrated.

Best regards,

Sebastian

Hi, try to use compression, check my post Live Migration enhancement proposal

Hello @opennebula2,

did you solve this problem after all? I am running into the same problem - one of my big-ish VMs is practically unmigratable because of that, and this makes installing host updates (such as the current flood of Meltdown/Spectre related kernel updates) a nightmare.

FWIW, here is the dirty memory size after several hours after onevm resched:

# while sleep 5; do virsh domjobinfo one-770 | grep 'Memory remaining'; done
Memory remaining: 567.691 MiB
Memory remaining: 1.239 GiB
Memory remaining: 707.289 MiB
Memory remaining: 1.307 GiB
Memory remaining: 778.078 MiB
Memory remaining: 1.395 GiB
Memory remaining: 870.316 MiB
Memory remaining: 313.070 MiB
Memory remaining: 965.949 MiB
Memory remaining: 405.328 MiB
Memory remaining: 1.032 GiB
Memory remaining: 495.066 MiB
Memory remaining: 1.128 GiB
Memory remaining: 596.141 MiB
Memory remaining: 1.208 GiB
Memory remaining: 676.055 MiB
Memory remaining: 1.245 GiB
Memory remaining: 712.984 MiB

I even tried to login to the host where the VM was running, killed the virsh migrate process and ran virsh migrate-setmaxdowntime 2000 (was 300 by default), but it did not help. Only after killing the migration process once again, setting max downtime to 20000 (20 seconds), and rescheduling the VM again, the migration finished in several minutes. The last “memory remaining” measurements for every 5 seconds were these:

Memory remaining: 3.376 GiB
Memory remaining: 2.829 GiB
Memory remaining: 2.283 GiB
Memory remaining: 2.796 GiB
Memory remaining: 2.247 GiB
Memory remaining: 0.000 B

So I guess max downtime of 5-10 seconds would be sufficient.

According to this presentation from 2015 it should be possible to set up a host-wide time limit for migrating all VMs away from the host, but I don’t know where to set it up neither in libvirt nor in OpenNebula. I think onevm flush command should use it, if possible.

Does anybody experience the same problem?

Thanks!

-Yenya

Hello, personally I use compression. In newer libvirt and qemu there is also posibility to use post-copy instead of pre-copy migration, but you need support in kernel. There is also post-copy-after-pre-copy option.

https://rk4n.github.io/2016/08/10/qemu-post-copy-and-auto-converge-features/