We use opennebula 4.12.1 on Debian Jessie with KVM, 6 nodes all connected using NFS3 to the same storage backend. When live migrating machines between nodes, some VM’s suddenly fail.
The VM is migrated to the new host, but does not respond to anything anymore. We cannot ping it, cannot connect to the console, and the kvm process is eating up all the CPU the VM is allowed to use. Nothing is logged.
We can however tell it so suspend, migrate again to another host, that sort of stuff.
I already rulled out a probel between the physical hosts. I can migrate 9 VM’s from node A to B and the 10th VM suddenly fails.
Hmmmm, no, I don’t think that is the issue. I doubt it is a virsh issue or even an OpenNebula issue for that matter.
But, I can’t put my finger on it where the problem actually occurs. It looks like everything is fine during/after the transfer, only the kvm process is eating up all the CPU and the VM is unreachable. My hunch is still that live migrating multiple machines at the same time is the problem somehow…
Not sure where to go from here and how to troubleshoot it.
How would I make sure only one migration is active to a host? Looking at the sched.conf I can specify I only want on migration per host per scheduling interval, but migration might take longer than the interval…
By default the drivers only performs one action per hosts, so probably it
won’t migrate more than one VM from the same host, but it could migrate
several to the same host as the action is not instantiated there. If that’s
the case, currently there is no logic to prevent that…
Was testing a bit. It also fails randomly with offline migration. I’m sure it was only 1 vm being migrated.
I’m sure it’s not an opennebula bug, but I’m at a bit of a loss where to go from here. There are no errors logged at all. The migration seems to succeed… only the VM is eating up all CPU and I cannot connect to it in any way.
I’ve seen this problem with old distros without ACPI, they never awoke after suspension (cold migration is in reality suspend/resume). Check that the system has acpid running and the VM has ACPI feature activated.
Hmmm, nope. All VM’s have acpi enabled and acpid running. Still have the problem.
I also disabled LRO and GRO on the NIC since I read this could cause problems with bridging. After about 100 migrate actions suddenly 2 VM’s died during migration. Again the KVM processes where there, eating up all the allocated CPU.
Looking at the docs I’m not sure if LOCALTIME fixes my problem. It looks more like a fix to make Windows use the correct time on boot.
I’m currently looking if changing the clocksource from kvm-clock to tsc fixes anything. I’ve read some reports from people where this worked. Those were old threads though. Testing will take some time since the VM’s need to be running a couple of days.
I think I have found the problem. It seems qemu/kvm in Debian Jessie has a bug which might let VM’s hang on migration under some circumstances. In Jessie the qemu version is 2.1.2, and in 2.1.3 there are some fixes. The packages in Debian will most likely be updated during the next point release (8.3).
I have been testing with qemu 2.1.3 (from the Debian maintainer git repo) and it seems to fix my problems.