I work on a live migration Improvement project using OpenNebula solution,
for now the idea that I have is to analyze the migration scripts, see the files sent over the network when migrating, try to compress when sending or try to change the method of compression if the solution already can compress the files before sending them.
If you have documents that can help me or track to follow, thank you for sharing
Opennebula uses the migrate script of the vmm actions to initiate the live migrations. Take a look at the vmm drivers, located at /var/lib/one/remotes/vmm/
IMO there are different places that can be improved.
First of all, the disks of the VM should be on a “shared” datastore because the disks images must be accessible from both source and destination hosts. Before VMM_MAD/migrate is called the TM_MAD/premigrate script is called to handle disks. At this stage the speed depend on the exact TM_MAD driver used for the disk images and the number of the attached disks.
Then the VMM_MAD/migrate is called on the source host. The migration is done with libvirt via qemu+ssh protocol so there is not much space for improvements. If you have faster network available(10G,40G), you can tweak the VMM_MAD/migrate to migrate over the faster network(we have a variant that is in our pipeline to push it soon to our addon). Another option is to try some different migration scenarios.
After successful VM migration TM_MAD/postmigrate script is called to cleanup the source host.
By default TM_MAD/ssh has no live migration enabled because it is not shared and the volatile disks that are located on the SYSTEM datastore are killer to transfer over the network.
In the VM home located in the SYSTEM datastore there are 3 of 4 files that are bigger and need special handling:
persistent/non-persistent disk images
volatile disk images
the contextualization ISO disk image
the checkpoint file (not related to the live migration)
if the TM_MAD we supports both SYSTEM and IMAGE datastores and can provide fast access for persistent/nonpersistent, volatile and context disk images to both source and the destination hosts there is a chance to have really fast live migration.
As our addon-storpool support both SYSTEM and IMAGE datastores the VM migration is very fast. Even the “cold” migration is faster because only the deployment XML of the VM is transferred over the network. This is because we are extending the TM_MAD/ssh and TM_MAD/shared with our pre/postmigrate scripts that handle all of the bigger files in the VM home (the checkpoint import/export to storpool block device is WIP in the “next” branch).
Take a look at our addon and use the README.md file (or the install.sh file but here is a mess to handle upgrades) as a guide where/how we are integrating with the other modules of OpenNebula. I’ll be happy to answer any technical questions If something is not clear.
Finally I have some time to play with the VM migrations.
With a little tweak I’ve managed to live migrate a VM using 10G interface. The migration time (as expected) is almost 10x faster. My tests shows that is is not destructive and by default the scripts will behave as usual.
The patch is introducing extra argument for the migration interface which is the default dest_host but with appended “optional” variable DEST_HOST_APPEND
Here is simple followup how to enable the change
# on each host:
echo "IP_OF_THE_10G_IFACE1 hostname1-10g1" >>/etc/hosts
echo "IP_OF_THE_10G_IFACE2 hostname2-10g1" >>/etc/hosts
echo "IP_OF_THE_10G_IFACE3 hostname3-10g1" >>/etc/hosts
# add the append to the kvmrc
echo "DEST_HOST_APPEND=\"-10g1\"" >>/var/lib/one/remotes/vmm/kvm/kvmrc
# assuming the patch is in the user's home
cd /var/lib/one/remotes/vmm/kvm/
patch -p0 < ~/00-vmm_kvm_migrate.patch
# sync the changes
su - oneadmin
onehost sync --force
Carlos, i’ve spot a little bug in the migrate_local script here is the pull request against the current master branch.