Does anybody see the "CPU stuck for <large number>s" guest errors?

Hello,

my quesstion is not directly related to OpenNebula (probably). But for last few months, I ocassionally see my guest VMs locking up with the following error printed to their console:

Oct  4 09:55:40 guest123 kernel: [2148740.198048] watchdog: BUG: soft lockup - CPU#5 stuck for 2000426s! [swapper/5:0]
Sep  9 13:16:53 guest123 kernel: [2148750.428411] watchdog: BUG: soft lockup - CPU#6 stuck for 2000430s! [lua5.2:784]

As far as I know, having CPU stuck for several seconds means that the QEMU thread in question simply did not receive the CPU time on an over-provisioned host. But this is something different: note that the lock-up time is HUGE (~23 days), and even the timestamp of the first message is 23 days in the future.

So this means timekeeping inside Qemu went horribly bad, maybe because of Qemu threads being scheduled on different host CPUs.

It usually happens for me after the VM is live-migrated to a different host, but I think it sometimes happens even without migration. And on the other hand, I wrote a script that live migrates one of my VMs sequentially to all my physical hosts, and the script completed 5 loops without the VM crashing or reporting the stuck CPU.

My physical hosts have time synchronized with the local NTP server, and I have verified that the time on them is in indeed in sync.

Does anybody see this? Thanks,

-Yenya

OK, apparently downgrading qemu-kvm-core helped.

2022-09-09T16:33:16+0200 SUBDEBUG Downgrade: qemu-kvm-core-15:6.2.0-5.module_el8.6.0+1087+b42c8331.x86_64
2022-09-09T16:33:16+0200 SUBDEBUG Downgraded: qemu-kvm-core-15:6.2.0-12.module_el8.7.0+1140+ff0772f9.x86_64