Has something changed in the network stack on 6.0?

I had a running 5.12 cluster in my home lab and decided to upgrade/migrate it to 6.0.

Being an IT Admin for some 20 years I was able to migrate it easily enough, however, I have a slightly odd networking config.

I static route (next-hop) private range /24s to each host this may not be intended functionality but for my network topology, it worked the way I wanted.

Which was to have natively routable connectivity to different subnets than my main LAN.

So the setup was pretty simple.

router (172.16.0.1 → static route 192.168.3.0/24 → 172.16.0.50 (ONE kvm host)

KVM Host
eth0 172.16.0.50
virbr0 192.168.3.1

Virtual Network configured accordingly with the correct subnet netmask and gateway…

Now, this worked “OK” I was not in love with the setup because it meant that I had to manually route a /24 to each host and when creating VMs I had to manually specify which host and network to use… and I would have rathered it “figure it out” so to speak. wherein I can say this virtual network is bound to this host and this host only and only use this network if using this host. but as far as I can tell opennebula does not have a way to automatically lock hosts to specific networks.

Right so the main issue, is and I can’t really find anything in the logs that is telling me why this is happening…

But now, after I have upgraded to 6.0 every time a network gets attached to a VM, the host’s eth0 loses its IP and gateway it literally gets removed from eth0 and the host goes offline and I have to either reboot it or restart the network stack.

Interestingly if I restart the network stack (Down && Up) eth0 everything starts working, host is pingable VM is pingable. but if I reboot the VM or start a new one (rather I should say attach a network to a VM) because VMs with no network attached do not cause the problem… everything goes back down again eth0 on host loses it’s IP/route

So has something major changed in 6.0 that would cause this?

Hello @decryptedchaos,

could you share the output of onevnet show -x <id> for your virtual network?

but as far as I can tell opennebula does not have a way to automatically lock hosts to specific networks.

Regarding the above, maybe putting the host and the virtual network in a cluster do the trick. Remember to put the corresponding DS in the cluster too.

<VNET>
      <ID>3</ID>
      <UID>0</UID>
      <GID>0</GID>
      <UNAME>oneadmin</UNAME>
      <GNAME>oneadmin</GNAME>
      <NAME>Test-3</NAME>
      <PERMISSIONS>
        <OWNER_U>1</OWNER_U>
        <OWNER_M>1</OWNER_M>
        <OWNER_A>0</OWNER_A>
        <GROUP_U>0</GROUP_U>
        <GROUP_M>0</GROUP_M>
        <GROUP_A>0</GROUP_A>
        <OTHER_U>0</OTHER_U>
        <OTHER_M>0</OTHER_M>
        <OTHER_A>0</OTHER_A>
      </PERMISSIONS>
      <CLUSTERS>
        <ID>0</ID>
        <ID>100</ID>
      </CLUSTERS>
      <BRIDGE><![CDATA[virbr0]]></BRIDGE>
      <BRIDGE_TYPE><![CDATA[linux]]></BRIDGE_TYPE>
      <PARENT_NETWORK_ID/>
      <VN_MAD><![CDATA[fw]]></VN_MAD>
      <PHYDEV><![CDATA[enp6s0]]></PHYDEV>
      <VLAN_ID/>
      <OUTER_VLAN_ID/>
      <VLAN_ID_AUTOMATIC>0</VLAN_ID_AUTOMATIC>
      <OUTER_VLAN_ID_AUTOMATIC>0</OUTER_VLAN_ID_AUTOMATIC>
      <USED_LEASES>3</USED_LEASES>
      <VROUTERS/>
      <TEMPLATE>
        <BRIDGE><![CDATA[virbr0]]></BRIDGE>
        <BRIDGE_TYPE><![CDATA[linux]]></BRIDGE_TYPE>
        <CLUSTER_IDS><![CDATA[0]]></CLUSTER_IDS>
        <DNS><![CDATA[4.2.2.2]]></DNS>
        <GATEWAY><![CDATA[192.168.3.1]]></GATEWAY>
        <GUEST_MTU><![CDATA[1500]]></GUEST_MTU>
        <NETWORK_ADDRESS><![CDATA[192.168.3.0]]></NETWORK_ADDRESS>
        <NETWORK_MASK><![CDATA[255.255.255.0]]></NETWORK_MASK>
        <PHYDEV><![CDATA[enp6s0]]></PHYDEV>
        <SECURITY_GROUPS><![CDATA[0]]></SECURITY_GROUPS>
        <TEMPLATE_ID><![CDATA[0]]></TEMPLATE_ID>
        <VN_MAD><![CDATA[fw]]></VN_MAD>
      </TEMPLATE>
      <AR_POOL>
        <AR>
          <AR_ID><![CDATA[0]]></AR_ID>
          <IP><![CDATA[192.168.3.2]]></IP>
          <MAC><![CDATA[02:00:c0:a8:03:02]]></MAC>
          <SIZE><![CDATA[200]]></SIZE>
          <TYPE><![CDATA[IP4]]></TYPE>
          <MAC_END><![CDATA[02:00:c0:a8:03:c9]]></MAC_END>
          <IP_END><![CDATA[192.168.3.201]]></IP_END>
          <USED_LEASES>3</USED_LEASES>
          <LEASES>
            <LEASE>
              <IP><![CDATA[192.168.3.2]]></IP>
              <MAC><![CDATA[02:00:c0:a8:03:02]]></MAC>
              <VM><![CDATA[33]]></VM>
            </LEASE>
            <LEASE>
              <IP><![CDATA[192.168.3.3]]></IP>
              <MAC><![CDATA[02:00:c0:a8:03:03]]></MAC>
              <VM><![CDATA[34]]></VM>
            </LEASE>
            <LEASE>
              <IP><![CDATA[192.168.3.4]]></IP>
              <MAC><![CDATA[02:00:c0:a8:03:04]]></MAC>
              <VM><![CDATA[2]]></VM>
            </LEASE>
          </LEASES>
        </AR>
      </AR_POOL>
    </VNET>

Regarding the cluster tip, that might be a valid solution I’ll test it if/when i get this network issue solved.

In further testing this appears to be libvirt related? or at least ONE is passing something into libvirt network that causes it.

I just wish I had a detailed breakdown of every single step that happened on the ONE backend, the oned.log and such don’t have anything that’s cluing me in to the cause.

EDIT: Okay so this looks to be coming from how ONE is trying to configure the bridge.
In the console you can see the frontend ssh into the host and /sbin/ip link set virbr0 up & /sbin/ip link set enp6s0 master virbr0

This is where I think the issue is coming from, virbr0 already exists and its completely reconfiguring the bridge topology

Furthermore, I have discovered the point at which this happens. is

ip link set enp6s0 master virbr0

This causes the problem, somehow, for some reason once this happens the connectivity from enp6s0 no longer passes.

Any suggestions for troubleshooting?

It seems that the behavior change you’re seeing is related with this commit.

The problem was the the PHYDEV attribute, defined in the documentation like:

Name of the physical network device that will be attached to the bridge (does not apply for dummy driver)

was not taken into account in 5.12 version because of a bug. For 6.0.0.1 the commit referenced above have been added to keep the behavior accordingly to the documentation (which was the previous and expected behavior).

As you’ve mentioned when the PHYDEV (enp6s0 in your case) is plugged into the bridge it’s network configuration won’t have effect outside of it. So, in order to keep using a similar environment you would need to:

  • Move the network configuration (i.e IP address) to the bridge.
  • Plug the physical interface into the bridge (if it’s not already done).
  • Set the properly value of keep_empty_bridge configuration attribute to make sure the bridge is not deleted when no VMs are using it, so the configuration is not lost.

Also, you could be able to manually mange the bridge networking without OpenNebula interfering with it by using VN_MAD=dummy, but it won’t allow you to set security groups.

I have solved this with the USE of the VM_MAD=dummy driver. It works best in my case as I rarely need security groups in the lab. and I already have carrier-grade routers and switches.

A further note, separating the clusters doesn’t appear to work, for IP segregation, because the template doesn’t select a network by default. and the only way around it is to have templates for each cluster. which would be slightly annoying to manage. if you have further advice on making this work more seamlessly please advise

Nice to hear that it helped!

For your networking use case, maybe the automatically network selection helps: Scheduling Policies — OpenNebula 6.0.1 documentation