I have an odd issue that i am not sure how to troubleshoot.
If i launch a new vm from a template it will hang in the pending state for up to 35mins or until i restart one.
After that the next interval triggers and the vms are dispatched and everything is fine. If instantiate more vms shortly after they to launch fine more or less in the 30seconds configured in sched,conf. but at some point after about 5-10 mins new vms will no longer be dispatched within the expected time.
i upped the log level and can see the time taken for the last 2 scheduling actions: 25 and 35 mins.
Fri Jan 8 15:08:27 2016 [Z0][SCHED][D]: Dispatching VMs to hosts. Total time: 0.02s
Fri Jan 8 15:34:03 2016 [Z0][SCHED][D]: Getting scheduled actions information. Total time: 1506.22s
Fri Jan 8 15:34:03 2016 [Z0][SCHED][D]: Getting VM and Host information. Total time: 0.01s
Fri Jan 8 16:10:02 2016 [Z0][SCHED][D]: Getting scheduled actions information. Total time: 2129.15s
As far as i can see all other activity and actions carry on as normal. Existing vms can migrate around etc. There is only one cluster of 3 centos kvm hosts. There does not appear to be any info in the various logs to indicate a reason they are waiting to be dispatched.
The resources allocated are not beyond those available. Scheduling is to spread vms between hosts. Storage is a ceph cluster. The same scheduling issue happens with 40mb ttylinux test vms and 50 gig Ubuntu vms.
The three hosts are in the same cluster and the templates require only that they are deployed to that cluster.
I can’t find evidence that the system is unable to decide which of the 3 hosts to send each vm to over than that long gathering info period. The dispatch itself takes moments.
I was worried about those ~1000s for the scheduled actions, those are
actually obtained through the same api call than a onevm list command. Size
of pools (VM, Host,…) is the main factor that determines the scheduling
execution time. Could it be a networking problem from the machine running
the scheduler, for example a http_proxy variable, naming resolution
I have had issues with the http proxy before but they are now resolved. The same or a similar log entry appear when I restart one. But that is because the process is briefly down and it assumes it’s unreachable. There is also no delay with name resolution either. After resetting one it always catches the pending deployments and they are out within 30 secs.
I was expecting to see something log suggesting it can come to a decision about which host to dispatch to since they are all ‘equal’. I suspect it’s something funky with my environment but I don’t see it.