VXLAN broadcast unreliable (fixed by ifconfig promisc, but not by allmulti)

Hi all,

I am experiencing a strange problem with VXLANs, and my primary question is whether somebody else has seen this (I don’t think this is a problem inside ONe, but maybe in the kernel or NIC drivers):

My cluster hosts have two NICs, and I dedicated “eth1” for ONe virtual networks with VXLANs. So eth1 on each host has MTU 9000, private IPv4 address (all hosts in the same IPv4 prefix), and all hosts have their eth1 connected to the same L2 network (VLAN on the same switch). I use VXLANs as ONe VNETs with PHYDEV=eth1. Broad/multi-cast of the overlay networks are mapped to L3 multicast of theunderlay eth1, no BGP EVPN or other software-based logic is used. The ethernet switch does not have IGMP snooping enabled.

Now when I create a new VNet/VXLAN and attach two VMs to it, they can ping each other. Sometimes it is a bit sluggish, they start replying to ping after several seconds. So I think my setup is mostly correct. Then I created a new VNet for about 400 VMs and a virtual router. When I monitor the connectivity from the virtual router, some of the 400 VMs randomly stop replying to ping, and recover after several minutes later. For some of the VMs, the outage is much longer, like several hours. Usually after the reboot or power down/up cycle they start pinging again.

It took me a few days to debug, but it sometimes helped to remove the VM MAC address from the VXLAN virtual bridge (one.vni) using bridge fdb delete command, and let the bridge logic to learn the MAC address again. But for some VMs, this did not work, I was not able to ping them from the gateway for several hours. According to tcpdump -i one.vni on the host where the virtual gateway was running, the router sent ARP request for that VM as expected, but did not get the reply. tcpdump -i eth1 displayed the ARP request being sent out from the host encapsulated into the underlay network multicast, with correct L2 multicast address. However, the other host, where the VM which was being pinged was running, tcdpump -i eth1 did not show that multicast packet at all, even though some other multicast packets were coming through successfully.

Only after I ran tcpdump without the -p swhitch (I use -p by default), the ping started to work, and of course the first tcpdumped packet was the missing ARP request encapsulated in underlay network multicast. So I started to suspect the multicast filters inside the NIC. But ifconfig eth1 allmulti did not help at all - after deleting the newly-learned ARP table entry from the virtual router, it could not learn it back. But after ifconfig eth1 promisc it worked again.

So now my cluster hosts have their eth1 cards in promisc mode, and all 400 VMs can reliably ping each other without connectivity outages since 6 days ago when I put eth1s in promisc mode. I wonder where the problem can be - with allmulti not fixing the problem, I think the limit on the number of HW multicast filters inside the NIC can be ruled out. With promisc fixing the problem, the L2 switch setup can also be ruled out. Could it be that the kernel is setting the multicast filters incorrectly?

Does anybody here use VXLAN with underlay network mutlicast (i.e. without BGP-EVPN)? Does it work for you reliably? What else should I try in order to debug the problem?

Thanks,

-Yenya

Hi @Yenya ,

Are you using a proper netmask? Note that /24 can only address 254 hosts, you’ll need at least a /23 netmask to address up to 510 hosts.

On the other hand, if you use -e vlan option for tcpdump the VLAN tagged traffic will be displayed (maybe you find this useful for debugging).

Cheers.

Hi Ricardo,

yes, the overlay network uses /16 prefix. And of course with wrong netmask there would be VMs that do not communicate with the virtual router at all, consistently. Which is not what I observed.

There is no dot1q tagged traffic on eth1 - I mentioned VLANs only in context that all eth1 ports are connected to a single VLAN on the switch (i.e. to a single L2 network). The traffic over the wire between eth1 and the switch is of course untagged.

Thanks,

-Yenya

Hi Jan,
I had similar issues and checked the rp_filter on the underlying interface (in this case vxb94).
Setting net.ipv4.conf.vxb94.rp_filter=2 made it work reliable (multicast issue before, when rp_filter was at default “1”)
Best, Michael

MIchael, thanks for the hint. It is late evening here, not many students working. So I dare to do “ifconfig eth1 -promisc” on all the physical hosts. After that, I did “arp -d macaddr” for every L2 address on a virtual router. After that, according to my ping-based monitoring, many VMs went unreachable (pinging from that virtual router). Slowly, some of them managed to get the ARP request through, and went online again. But not many (hundreds went down, and about 10-20 went back online in several minutes). Then I did “echo 2 > /proc/sys/net/ipv4/conf/eth1/rp_filter” on all the physical hosts. A few VMs went back online, but majority was still offline. After several minutes I did “ifconfig eth1 promisc”, and all the running VMs went back online, and got reachable by ping from the virtual router.

So, thanks for the hint, but my problem seems to be different.

-Yenya