We stumbled upon a blocking scheduler bug in 6.8 and 6.10 and found that this is actually pretty old. It’s probably been around since 5.10.
How to reproduce the bug:
- Create 2 clusters A and B.
- Assign a host for each cluster that has for example 16 cores.
- Create a VM, let’s call it VM(a) with 20 cores, and assign it to cluster A.
This VM naturally won’t ever fit in there, but this is just to mimic a situation where a cluster has other VMs filling it so that the next one (VM(a)) won’t fit.
- Create another VM, let’s call it VM(b), with 4 cores and assign it to cluster B.
The scheduler should pick it up and assign to cluster B, where it would fit just fine.
The bug: It doesn’t.
Debug shows: Host 2 discarded for VM 9054. Cannot allocate NUMA topology
Reason: The loop in Scheduler.cc doesn’t produce a clean “HostShareCapacity” struct each time it runs vm->get_capacity(sr). Alas it gets NUMA information from VM(a) when going through the data for VM(b).