I/O wait and CPU load high => VM stuck

Hi there,

first the hard facts:
OpenNebula 4.14.2
VM Hosts: Debian 8
ONE Management: Ubuntu 14.04 LTS
VM Guests: Ubuntu 14.04 LTS
NO SHARED FILESYSTEM (only ssh to deploy the VMs)

All VMs are running with this parameters:
/usr/bin/qemu-system-x86_64 -name one-125 -S -machine pc-i440fx-2.1,accel=kvm,usb=off -m 4096 -realtime mlock=off -smp 2,sockets=2,cores=1,threads=1 -uuid xxxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxx -no-user-config -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/one-xx.monitor,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc -no-shutdown -boot strict=on -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -drive file=/data/one/0/xx/disk.0,if=none,id=drive-virtio-disk0,format=qcow2,cache=none -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 -drive file=/data/one/0/xxx/disk.1,if=none,id=drive-ide0-0-0,readonly=on,format=raw -device ide-cd,bus=ide.0,unit=0,drive=drive-ide0-0-0,id=ide0-0-0 -netdev tap,fd=46,id=hostnet0 -device rtl8139,netdev=hostnet0,id=net0,mac=02:00:xx:xx:xx:xx,bus=pci.0,addr=0x3 -vnc 0.0.0.0:125 -device cirrus-vga,id=video0,bus=pci.0,addr=0x2 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x5 -msg timestamp=on

So I ran into this weird isse a few days ago:
VM was running with 2 vCPU (0.25 CPU) and 4 gig memory, doing some basic VM-ing with an avg load off 0.5-0.9. And suddenly, the load spiked to above the healthy level and got stuck at an load avg of 10. Also the I/O wait was fired to the moon and all services stopped responding. No ssh, no console via sunstone, no “force” reboot/shutdown, nothing. All operations timed out.
Sunstone showed me 100% “real” CPU too.
Solution: find PID of the KVM process, kill it and wait for oned to recognize the killed VM to resume it.

How do I:

  • find the root cause on the KVM/libvirt side? (No other VMs on the same host where affected, only this one) OR
  • avoid with some tweaks high I/O wait
  • handle VMs stuck in this state “automatically” with OpenNebula? (I don’t want to get up at 3 in the morning to kill a process and start it again)

If you need more information, I’m happy to share it with you.

EDIT: I currently have another VM in this “high-load” state.

Thanks!

Stephan opennebula@discoursemail.com writes:

Hi there,

Hello,

[…]

So I ran into this weird isse a few days ago:
VM was running with 2 vCPU (0.25 CPU) and 4 gig memory, doing some basic VM-ing with an avg load off 0.5-0.9. And suddenly, the load spiked to above the healthy level and got stuck at an load avg of 10. Also the I/O wait was fired to the moon and all services stopped responding. No ssh, no console via sunstone, no “force” reboot/shutdown, nothing. All operations timed out.
Sunstone showed me 100% “real” CPU too.
Solution: find PID of the KVM process, kill it and wait for oned to recognize the killed VM to resume it.

How do I:

  • find the root cause on the KVM/libvirt side? (No other VMs on the same host where affected, only this one) OR
  • avoid with some tweaks high I/O wait
  • handle VMs stuck in this state “automatically” with OpenNebula? (I don’t want to get up at 3 in the morning to kill a process and start it again)

If you need more information, I’m happy to share it with you.

We have sometime issues with disk I/O on our SAN, when some VMs are
swapping, disks I/O increase on the hypervisors without much CPU use,
but load is getting stratospheric (max was 320 on our setup).

Accessing the VMs became impossible, but ssh on hypervisors was ok since
the OS does not use the SAN.

I don’t have an automatic solution, we need to find the problematic VMs to
poweroff them and resize the memory to avoid swap or maybe disable swap
at first, I’m not sure what’s the best to do.

Regards.
Daniel Dehennin
Récupérer ma clef GPG: gpg --recv-keys 0xCC1E9E5B7A6FE2DF
Fingerprint: 3E69 014E 5C23 50E8 9ED6 2AAD CC1E 9E5B 7A6F E2DF

Same, the host the VM is running on has a “normal” load, only the particular VM is blocking. All other VMs running on the same disks on the same host are acting normal. No high I/O wait or load on them.

Disable the swap changes nothing as the VM is not using all of the memory (currently, right before the freeze, around 500 MB out of 4 GB). Also, swappiness is set to 10 so swapping should not happen in this “early” stage.

Is it possible to build a hook to kill the VM when it becomes unresponsive? It would be a very, very dirty hack but currently the only way I can think of to sort out the “get up at 3 in the morning”-problem.

Regrads,
Stephan

Seems like both your problems relate to Linux/KVM, not something related to opennebula. Maybe there is a smart way to make opennebula solve your use case, but the troubleshooting needs to happen @ hypervisor level (so KVM/libvirt).
Since one of you is using networked storage and the other isnt, your problems might not even be related…

Anyway, some hopefully helpful tips to dig down to the cause:

  • install munin-node on all your VMs and make them report to a munin-server. (or any other monitoring/graphing tool you prefer) Easy to use with default plugins and should provide more info about the cause. Shows interrupts, I/O stats, mem stats, CPU usage, in daily / weekly / monthly graphs. If needed you can zoom in up to 5 min. level values to detect start-of-failures.
  • As a workaround for the get-up-at-3-issue: maybe add a every minute crontab to VMs to reboot whenever the 15 min. load average is ie. >50 ? (assuming your workload allows this, of course).
  • Here is some info about debugging KVM/libvirt to find the root cause: https://fedoraproject.org/wiki/How_to_debug_Virtualization_problems has some nice debug-pointers.

Yeah, I totally agree, but as I’m not so familiar with KVM/libvirt itself, I started asking here (and I hoped there are other users with the same problems using ONE).

Done, I have zabbix running to report those values and I can see, there is nothing special right before I/O wait grows on the VM. The host itself is relaxed.

I’m not a fan of doing load-triggered reboots, since they won’t fire if the load is really heavy.

Thanks, been there already but they are mostly related to hypervisor issues. I already checked the guests if they are using the correct emulator etc.

Thanks and regards

Got the same problem here…did you ever find a solution ?

BR

Hi all, you should also set QoS parameters on VM disks, to limit IO