Live migration fails randomly

Hi,

We use opennebula 4.12.1 on Debian Jessie with KVM, 6 nodes all connected using NFS3 to the same storage backend. When live migrating machines between nodes, some VM’s suddenly fail.

The VM is migrated to the new host, but does not respond to anything anymore. We cannot ping it, cannot connect to the console, and the kvm process is eating up all the CPU the VM is allowed to use. Nothing is logged.

We can however tell it so suspend, migrate again to another host, that sort of stuff.

I already rulled out a probel between the physical hosts. I can migrate 9 VM’s from node A to B and the 10th VM suddenly fails.

Anyone got an idea what might be going on here?

Regards,

Sander

Any relevant information in the logs? OpenNebula will output the error of
the virsh command; also you may look to the qemu logs…

I couldn’t find any errors in logfiles.

The only thing I noticed is during migration, another VM started live migration from the same source host to another destination host. Don’t know if multiple live migrations can be a problem.

Here are some logs related to the migration:

34.log (vmid):

Tue Aug  4 09:27:14 2015 [Z0][LCM][I]: New VM state is MIGRATE
Tue Aug  4 09:27:14 2015 [Z0][VMM][I]: Successfully execute transfer manager driver operation: tm_premigrate.
Tue Aug  4 09:27:14 2015 [Z0][VMM][I]: ExitCode: 0
Tue Aug  4 09:27:14 2015 [Z0][VMM][I]: Successfully execute network driver operation: pre.
Tue Aug  4 09:27:43 2015 [Z0][VMM][I]: ExitCode: 0
Tue Aug  4 09:27:43 2015 [Z0][VMM][I]: Successfully execute virtualization driver operation: migrate.
Tue Aug  4 09:27:44 2015 [Z0][VMM][I]: ExitCode: 0
Tue Aug  4 09:27:44 2015 [Z0][VMM][I]: Successfully execute network driver operation: clean.
Tue Aug  4 09:27:44 2015 [Z0][VMM][I]: ExitCode: 0
Tue Aug  4 09:27:44 2015 [Z0][VMM][I]: Successfully execute network driver operation: post.
Tue Aug  4 09:27:44 2015 [Z0][VMM][I]: Successfully execute transfer manager driver operation: tm_postmigrate.
Tue Aug  4 09:27:44 2015 [Z0][LCM][I]: New VM state is RUNNING

oned.log:

Tue Aug  4 09:27:14 2015 [Z0][VMM][D]: Message received: LOG I 34 Successfully execute transfer manager driver operation: tm_premigrate.
Tue Aug  4 09:27:14 2015 [Z0][VMM][D]: Message received: LOG I 34 ExitCode: 0
Tue Aug  4 09:27:14 2015 [Z0][VMM][D]: Message received: LOG I 34 Successfully execute network driver operation: pre.
Tue Aug  4 09:27:43 2015 [Z0][VMM][D]: Message received: LOG I 34 ExitCode: 0
Tue Aug  4 09:27:43 2015 [Z0][VMM][D]: Message received: LOG I 34 Successfully execute virtualization driver operation: migrate.
Tue Aug  4 09:27:44 2015 [Z0][VMM][D]: Message received: LOG I 34 ExitCode: 0
Tue Aug  4 09:27:44 2015 [Z0][VMM][D]: Message received: LOG I 34 Successfully execute network driver operation: clean.
Tue Aug  4 09:27:44 2015 [Z0][VMM][D]: Message received: LOG I 34 ExitCode: 0
Tue Aug  4 09:27:44 2015 [Z0][VMM][D]: Message received: LOG I 34 Successfully execute network driver operation: post.
Tue Aug  4 09:27:44 2015 [Z0][VMM][D]: Message received: LOG I 34 Successfully execute transfer manager driver operation: tm_postmigrate.

This may be a concurrency problem of virsh, look here:

https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/7/html/Virtualization_Deployment_and_Administration_Guide/sect-KVM_live_migration-Live_KVM_migration_with_virsh.html

under 18.4.1. Additional tips for migration with virsh

It seems that it is needed to open more connections…

Hmmmm, no, I don’t think that is the issue. I doubt it is a virsh issue or even an OpenNebula issue for that matter.

But, I can’t put my finger on it where the problem actually occurs. It looks like everything is fine during/after the transfer, only the kvm process is eating up all the CPU and the VM is unreachable. My hunch is still that live migrating multiple machines at the same time is the problem somehow…

Not sure where to go from here and how to troubleshoot it.

Sander

How would I make sure only one migration is active to a host? Looking at the sched.conf I can specify I only want on migration per host per scheduling interval, but migration might take longer than the interval…

By default the drivers only performs one action per hosts, so probably it
won’t migrate more than one VM from the same host, but it could migrate
several to the same host as the action is not instantiated there. If that’s
the case, currently there is no logic to prevent that…

Was testing a bit. It also fails randomly with offline migration. I’m sure it was only 1 vm being migrated.

I’m sure it’s not an opennebula bug, but I’m at a bit of a loss where to go from here. There are no errors logged at all. The migration seems to succeed… only the VM is eating up all CPU and I cannot connect to it in any way.

I’ve seen this problem with old distros without ACPI, they never awoke after suspension (cold migration is in reality suspend/resume). Check that the system has acpid running and the VM has ACPI feature activated.

Hmmm, nope. All VM’s have acpi enabled and acpid running. Still have the problem.

I also disabled LRO and GRO on the NIC since I read this could cause problems with bridging. After about 100 migrate actions suddenly 2 VM’s died during migration. Again the KVM processes where there, eating up all the allocated CPU.

I’m wondering if the error could be TSC related somehow.

1 Like

I think your right.
I get this from kvm doc(Migration - KVM).

Problems / Todo
TSC offset on the new host must be set in such a way that the guest sees a monotonically increasing TSC, otherwise the guest may hang indefinitely after migration.

hi @roedie , I think you can add “FEATURES=[LOCALTIME=“yes”]” in your template.

then create vms and do the live migration test. I thin it can solve this fails.

Hey @linuxwind,

Looking at the docs I’m not sure if LOCALTIME fixes my problem. It looks more like a fix to make Windows use the correct time on boot.

I’m currently looking if changing the clocksource from kvm-clock to tsc fixes anything. I’ve read some reports from people where this worked. Those were old threads though. Testing will take some time since the VM’s need to be running a couple of days.

Hey There,

I think I have found the problem. It seems qemu/kvm in Debian Jessie has a bug which might let VM’s hang on migration under some circumstances. In Jessie the qemu version is 2.1.2, and in 2.1.3 there are some fixes. The packages in Debian will most likely be updated during the next point release (8.3).

I have been testing with qemu 2.1.3 (from the Debian maintainer git repo) and it seems to fix my problems.

Sander

Hi,

I’m running Debian 8.9 and the qemu package is always 2.1.2 and I’m facing the same issue with live migration.
Could you provide the package url that you have used ?

Thanks,
Yannick

ok, just use the jessie-backports repository to get a 2.8 version.