I am experiencing a strange problem with VXLANs, and my primary question is whether somebody else has seen this (I don’t think this is a problem inside ONe, but maybe in the kernel or NIC drivers):
My cluster hosts have two NICs, and I dedicated “eth1” for ONe virtual networks with VXLANs. So eth1 on each host has MTU 9000, private IPv4 address (all hosts in the same IPv4 prefix), and all hosts have their eth1 connected to the same L2 network (VLAN on the same switch). I use VXLANs as ONe VNETs with PHYDEV=eth1. Broad/multi-cast of the overlay networks are mapped to L3 multicast of theunderlay eth1, no BGP EVPN or other software-based logic is used. The ethernet switch does not have IGMP snooping enabled.
Now when I create a new VNet/VXLAN and attach two VMs to it, they can ping each other. Sometimes it is a bit sluggish, they start replying to ping after several seconds. So I think my setup is mostly correct. Then I created a new VNet for about 400 VMs and a virtual router. When I monitor the connectivity from the virtual router, some of the 400 VMs randomly stop replying to ping, and recover after several minutes later. For some of the VMs, the outage is much longer, like several hours. Usually after the reboot or power down/up cycle they start pinging again.
It took me a few days to debug, but it sometimes helped to remove the VM MAC address from the VXLAN virtual bridge (one.vni) using
bridge fdb delete command, and let the bridge logic to learn the MAC address again. But for some VMs, this did not work, I was not able to ping them from the gateway for several hours. According to
tcpdump -i one.vni on the host where the virtual gateway was running, the router sent ARP request for that VM as expected, but did not get the reply.
tcpdump -i eth1 displayed the ARP request being sent out from the host encapsulated into the underlay network multicast, with correct L2 multicast address. However, the other host, where the VM which was being pinged was running,
tcdpump -i eth1 did not show that multicast packet at all, even though some other multicast packets were coming through successfully.
Only after I ran
tcpdump without the
-p swhitch (I use -p by default), the ping started to work, and of course the first tcpdumped packet was the missing ARP request encapsulated in underlay network multicast. So I started to suspect the multicast filters inside the NIC. But
ifconfig eth1 allmulti did not help at all - after deleting the newly-learned ARP table entry from the virtual router, it could not learn it back. But after
ifconfig eth1 promisc it worked again.
So now my cluster hosts have their eth1 cards in promisc mode, and all 400 VMs can reliably ping each other without connectivity outages since 6 days ago when I put eth1s in promisc mode. I wonder where the problem can be - with allmulti not fixing the problem, I think the limit on the number of HW multicast filters inside the NIC can be ruled out. With
promisc fixing the problem, the L2 switch setup can also be ruled out. Could it be that the kernel is setting the multicast filters incorrectly?
Does anybody here use VXLAN with underlay network mutlicast (i.e. without BGP-EVPN)? Does it work for you reliably? What else should I try in order to debug the problem?