Architecture: 3-node Hyperconverged Cluster (Compute + Storage on the same nodes)
Compute: OpenNebula (KVM/Libvirt) with HA enabled.
Storage: Ceph (2-way replication, NVMe backed).
Hardware: Supermicro servers
OS: RHEL 9.6
The Issue:
When a Jenkins VM runs a legacy build pipeline (starting with a massive git clone followed by a heavy compilation make commands) Shortly after this starts, the host’s got shutdown looks like (primary network interface crashes and resets, causing Ceph OSDs to flap, QEMU to lose block storage, and OpenNebula to mark the host as ERROR and evacuate the VM.) The VM migrates to the next node, repeats the workload, and crashes that node as well.
Hardware monitoring shows CPU, RAM, and NVMe utilization are all barely breaking (iowait is ~0.00, disk util <1%).
The logs shows
May 18 04:21:10 kernel: i40e 0000:11:00.0 enp17s0f0: NETDEV WATCHDOG: CPU: 472: transmit queue 61 timed out 5584 ms
May 18 04:21:10 kernel: i40e 0000:11:00.0 enp17s0f0: tx_timeout: VSI_seid: 390, Q 61, NTC: 0x150, HWB: 0x150, NTU: 0x174, TAIL: 0x174, INT: 0x1
May 18 04:21:10 kernel: i40e 0000:11:00.0 enp17s0f0: tx_timeout recovery level 1, txqueue 61
May 18 04:21:34 kernel: i40e 0000:11:00.0 enp17s0f0: NIC Link is Down
May 18 04:21:40 kernel: i40e 0000:11:00.0 enp17s0f0: NIC Link is Up, 10 Gbps Full Duplex
May 18 04:21:26 ceph-osd[12166]: osd.4 845 heartbeat_check: no reply from 10.96.xx.xx:6832 osd.11
May 19 02:27:55 virtqemud[8829]: internal error: QEMU unexpectedly closed the monitor (vm='one-115'): ... error connecting: Connection timed out
Not sure if this issue is caused by Ceph/OpenNebula or is there any specific configuration is missing
Any help on this issue would be greatly appreciated!
the most suspicious indication is the tx_timeout. Assuming that the log snippet comes from the error-ed host, that usually means some obstruction down the line. I would recommend to run an analysis of the network, maybe some port buffers starvation on a switch?
TX looks like a backpressure at the first glance, not a host or its NIC error.
Thank you for the insight! This is a very interesting point and might explain the architectural root cause of the pressure on the NIC.
Just to give you an update, we were able to successfully stabilize the nodes and stop the cascading crashes by applying ethtool tuning to the i40e interface (ethtool -K enp17s0f0 tso off gso off and ethtool -G enp17s0f0 rx 4096 tx 4096). The hardware ring buffers were defaulting to 512, and the bursts of Ceph replication traffic from the Jenkins build were triggering a soft lockup and tx_timeout in the i40e driver. The tuning acts as a shock absorber and the nodes survive the pipeline now.
However, I want to ensure our network architecture isn’t fundamentally flawed and exacerbating the issue as you and Damian suggested.
Here is the output of our IP and bridge configurations on one of the affected worker nodes. Can you let me know if this looks like the “address out of the bridge” misconfiguration you suspected?
[root@sj-sv2-devop-80 ~]# ip -4 addr show enp17s0f0
[root@sj-sv2-devop-80 ~]#
[root@sj-sv2-devop-80 ~]# ip -4 addr show br0
13: br0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
inet 10.96.32.31/23 brd 10.96.33.255 scope global noprefixroute br0
valid_lft forever preferred_lft forever
[root@sj-sv2-devop-80 ~]# bridge link
2: enp17s0f0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 master br0 state forwarding priority 32 cost 100
2: enp17s0f0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 master br0 hwmode VEPA
3: enp17s0f1: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 hwmode VEPA
16: one-104-0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 master br0 state forwarding priority 32 cost 100
[root@sj-sv2-devop-80 ~]# nmcli con show
NAME UUID TYPE DEVICE
br0 d03bc4fb-1804-4d80-a217-95a08aa3db81 bridge br0
slave-enp17s0f0 aaf7715a-c7a9-4065-a1bf-09ddf1c62159 ethernet enp17s0f0
lo 98a8a6c1-bdd7-48b5-8d16-61ad9d2fb780 loopback lo
one-104-0 f83ecc20-6352-43d6-9ad8-559b252a3e87 tun one-104-0
enp17s0f0 7e579761-89ee-4533-9f68-cba2a52dbb8f ethernet --
[root@sj-sv2-devop-80 ~]# ip link show enp17s0f0 | grep mtu
2: enp17s0f0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq master br0 state UP mode DEFAULT group default qlen 1000
ARFS (ntuple-filters) Status:
[root@sj-sv2-devop-80 ~]# ethtool -k enp17s0f0 | grep ntuple
ntuple-filters: on
Please look at and let me know if anything looks weird to you
Hey, I think that setting up your MTU to 9000 (your switches must support it) will make a difference. It may not mean a great throughput improvement, but your servers and your network hardware will have to process less packets. For what I remember, Ceph recommends jumbo frames by default.
About the window scaling and rmem and wmem parameter, they do not look like they should be a big problem.
That makes sense. Enabling Jumbo Frames to reduce the packet and interrupt load on the i40e driver seems like a reasonable next step alongside the ring buffer changes.
I’ll coordinate with our networking team to confirm whether Jumbo Frames are supported and enabled on the Top-of-Rack switches. Based on that discussion, we’ll evaluate updating the MTU to 9000 on the relevant interfaces and bridges.
Thanks again for taking the time to review the configuration and point us in the right direction. I appreciate the help.