TSC assertion fails during live migration

Hello all,

since the upgrade to 6.6.0 CE some of my long-running VMs fail during reschedule/live migration with the following error message in the VM log file:

Wed May 31 11:50:41 2023 [Z0][VMM][I]: error:
internal error: qemu unexpectedly closed the monitor:
2023-05-31T09:50:38.345819Z qemu-kvm-one:
-device cirrus-vga,id=video0,bus=pci.0,addr=0x2:
warning: 'cirrus-vga' is deprecated, please use
a different VGA card instead
qemu-kvm-one: ../hw/i386/kvm/clock.c:88:
kvmclock_current_nsec: Assertion
`time.tsc_timestamp <= migration_tsc' failed.

I think the important part is that TSC assertion, not the cirrus VGA deprecation.

Some VMs can be migrated without problems, but there is one particular VM running Windows Server, which fails every time I try to live-migrate it, and even apparently corrupts its disk during migration and cannot be booted anymore. Does this happen also on other ONe deployments? Does live migration work for you on libvirt/KVM hosts?


My ONe nodes run chrony against the NTP server in the local network, so I estimate their time should be in sync within 1 ms or less.

The ONe nodes are CentOS 8 stream, currently qemu-kvm-6.2.0-28.module_el8.8.0+1257+0c3374ae.x86_64.

That said, I have seen TSC-related problems in the past as well. So maybe this is only a new assertion in qemu-kvm, which catches the problem immediately instead of letting the VM cope with TSC going backwards or jumping way forward itself.

In previous ONe releases, I have occasionally seen TSC in the VMs jumping forward about +23 days after live migration, making even the Linux kernel inside the guest deeply confused.

With this assertion it looks like the root cause of the problem might happen during the migration itself, not some time after.

I see that my VMs run with -rtc base=local on the qemu-kvm command line. Should some other parameter like clock=rt be added to the cmdline?

Thanks for any hints,

-Yenya