Blocker bug in scheduler since 5.10

We stumbled upon a blocking scheduler bug in 6.8 and 6.10 and found that this is actually pretty old. It’s probably been around since 5.10.

How to reproduce the bug:

  • Create 2 clusters A and B.
  • Assign a host for each cluster that has for example 16 cores.
  • Create a VM, let’s call it VM(a) with 20 cores, and assign it to cluster A.

This VM naturally won’t ever fit in there, but this is just to mimic a situation where a cluster has other VMs filling it so that the next one (VM(a)) won’t fit.

  • Create another VM, let’s call it VM(b), with 4 cores and assign it to cluster B.

The scheduler should pick it up and assign to cluster B, where it would fit just fine.

The bug: It doesn’t.
Debug shows: Host 2 discarded for VM 9054. Cannot allocate NUMA topology

Reason: The loop in Scheduler.cc doesn’t produce a clean “HostShareCapacity” struct each time it runs vm->get_capacity(sr). Alas it gets NUMA information from VM(a) when going through the data for VM(b).

Hello @tosaraja,

I think that the best thing to do here is to report it on the Repo (One Issues).

If the issue is open, let me know the number, so I can check with the team.

Cheers,

Created: Blocker bug in scheduler since 5.10 · Issue #7071 · OpenNebula/one · GitHub
And PR as well.

1 Like