VM crashes one by one at a time

Hello

Im running opennebula 5.2.0 on Debian 8.6 with KVM . unfortunately my vms going to crash one at a time with no vnc no network and nothing but opennebula said everything ok, VM monitored well and state is running .
I have to poweroff hard the vm and resume again .
oned log is fine and here i want to know anyone had same experience or not ? any hint to look for problem ?

real memory : 113GB Allocated: 218GB Total: 284GB
real CPU : 500 Allocated : 6400 Total : 6000
Number of VMs : 50

Hello! Check logs in VM.

1 Like

Hello @UAnton
I checked it before in /var/log/libvirt/qemu/one-NUM.log show only variables , startingup and shutting down
/var/log/one/NUM.log shows usual logs .
which log do you mean ?
thanks

I mean, logs in crashed VM

1 Like

@UAnton I checked syslog and messages
Seems server went in blackout
Jan 15 20:49:44 mail postfix/qmgr[21830]: C1AD4A025A: removed
Jan 16 11:59:36 mail rsyslogd: [origin software=“rsyslogd” swVersion=“8.4.2” x-pid=“393” x-info=“http://www.rsyslog.com”] start

there is no log between 20:49:44 till 11:59:36 I thinks VM crashed in 20:49 and get back after I started again in 11:59

Hi Arash!
let’s see if we can find more information about what happened.

Could you run a onehost show X | grep ERROR? where X should be replaced with the ID of the KVM node where the VMs that you had to reboot where running. Let’s see if OpenNebula did notice any error in the KVM node.

If all VM showed the RUNNING state and not UNKNOWN that would mean that the KVM process would be informing OpenNebula that the VMs where indeed running though you could not get access to them. The fact that you could poweroff hard the VMs that means that the KVM process was being able to answer to requests.

Can you filter for libvirtd messages in your /var/log/syslog for your KVM node that may explain what happened to VMs running in that node? Any out of memory error, stack trace information or IO errors in your node’s logs?

Is this the first time you find this issue or it’s being happening periodically?

Cheers!

1 Like

Hello @mcabrerizo
Thank you for your answer
the command onehost show X | grep ERROR doesnt show any thing , its empty i ran the command with no grep and there is no errors
There is no Error log since 20:49 til 11:59 about that particular machine in /var/log/syslog and /var/log/messages
only this error repeated so many times the device for one-NUM entered promiscuous mode sth like that

all VMs is in running state now but i have to check when problem comes back and there is nothing in /var/log/syslog about libvirt or Error or anything else

it happends many times one machine at a time , for example last week we have this issue with another vm in this host

what do you suggest ?

Hi Arash!
I’m not a Debian guy so I hope I’m not suggesting you odd things for a KVM troubleshooting, I’m installing a Debian VM so I can look what more files could you check.

As you haven’t found any error in the mentioned files, I would also check if any weird stuff is in the /var/log/dmesg file of your KVM node. The point is that if you can’t find any error or hint related with KVM or Kernel it will be quite difficult to understand if those VMs are failing because of a storage problem, a QEMU option, IO blocks, memory… :frowning: As the KVM process reports those VMs to be running to OpenNebula I’d focus on KVM troubleshooting. If you’re using shared storage I’d also try to check if you have any performance issue… sorry being so vague but if no log message is found, I can’t imagine what could be the issue.

Maybe there’s a bug in Debian (Kernel or KVM stuff) so if you haven’t already done I’d try to check if there are updates for your node packages or look if there’s a bug in the Debian list related with the qemu-kvm package version that you have… proceed with caution of course.

I’ll keep thinking on what else can you try.

Cheers!

1 Like

I have nearly the same issue, except it’s a group of 8 VMs from 80 that all crash at exactly the same time. There is nothing on the hypervisors indicating a problem, the VNC console for the VM is completely unresponsive and does not have any indication of a kernel panic or anything of the sort. I’ve tried bringing up the VMs on different hypervisors and still have the same issue where they crash hard every 20-30 hours.

@Jake_Burns ,
I cant find any solution to fix the problem , I changed the hyper-visor’s OS from Debian 8 to Ubuntu 16.04 and everything is fine till now .
I think the problem was about kernel somehow with changing the OS I upgrade 3.18 kernel to 4.4 .