OpenNebula nodes getting shutdown when running heavy legacy Jenkins build on the VMs

Environment:

  • Architecture: 3-node Hyperconverged Cluster (Compute + Storage on the same nodes)

  • Compute: OpenNebula (KVM/Libvirt) with HA enabled.

  • Storage: Ceph (2-way replication, NVMe backed).

  • Hardware: Supermicro servers

  • OS: RHEL 9.6

The Issue:

When a Jenkins VM runs a legacy build pipeline (starting with a massive git clone followed by a heavy compilation make commands) Shortly after this starts, the host’s got shutdown looks like (primary network interface crashes and resets, causing Ceph OSDs to flap, QEMU to lose block storage, and OpenNebula to mark the host as ERROR and evacuate the VM.) The VM migrates to the next node, repeats the workload, and crashes that node as well.

Hardware monitoring shows CPU, RAM, and NVMe utilization are all barely breaking (iowait is ~0.00, disk util <1%).

The logs shows

May 18 04:21:10 kernel: i40e 0000:11:00.0 enp17s0f0: NETDEV WATCHDOG: CPU: 472: transmit queue 61 timed out 5584 ms
May 18 04:21:10 kernel: i40e 0000:11:00.0 enp17s0f0: tx_timeout: VSI_seid: 390, Q 61, NTC: 0x150, HWB: 0x150, NTU: 0x174, TAIL: 0x174, INT: 0x1
May 18 04:21:10 kernel: i40e 0000:11:00.0 enp17s0f0: tx_timeout recovery level 1, txqueue 61
May 18 04:21:34 kernel: i40e 0000:11:00.0 enp17s0f0: NIC Link is Down
May 18 04:21:40 kernel: i40e 0000:11:00.0 enp17s0f0: NIC Link is Up, 10 Gbps Full Duplex

May 18 04:21:26 ceph-osd[12166]: osd.4 845 heartbeat_check: no reply from 10.96.xx.xx:6832 osd.11
May 19 02:27:55 virtqemud[8829]: internal error: QEMU unexpectedly closed the monitor (vm='one-115'): ... error connecting: Connection timed out

Not sure if this issue is caused by Ceph/OpenNebula or is there any specific configuration is missing
Any help on this issue would be greatly appreciated!

Regards,

Laxman Singh Ahirwar

Hi Laxman,

the most suspicious indication is the tx_timeout. Assuming that the log snippet comes from the error-ed host, that usually means some obstruction down the line. I would recommend to run an analysis of the network, maybe some port buffers starvation on a switch?
TX looks like a backpressure at the first glance, not a host or its NIC error.

Hello, Laxman, what is your networking configuration?

This sounds compatible to be using a bridged network on a physical interface that has an address set up “out of the bridge”.

Thanks

Hi Bruno,

Thank you for the insight! This is a very interesting point and might explain the architectural root cause of the pressure on the NIC.

Just to give you an update, we were able to successfully stabilize the nodes and stop the cascading crashes by applying ethtool tuning to the i40e interface (ethtool -K enp17s0f0 tso off gso off and ethtool -G enp17s0f0 rx 4096 tx 4096). The hardware ring buffers were defaulting to 512, and the bursts of Ceph replication traffic from the Jenkins build were triggering a soft lockup and tx_timeout in the i40e driver. The tuning acts as a shock absorber and the nodes survive the pipeline now.

However, I want to ensure our network architecture isn’t fundamentally flawed and exacerbating the issue as you and Damian suggested.

Here is the output of our IP and bridge configurations on one of the affected worker nodes. Can you let me know if this looks like the “address out of the bridge” misconfiguration you suspected?

[root@sj-sv2-devop-80 ~]# ip -4 addr show enp17s0f0
[root@sj-sv2-devop-80 ~]# 

[root@sj-sv2-devop-80 ~]# ip -4 addr show br0
13: br0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    inet 10.96.32.31/23 brd 10.96.33.255 scope global noprefixroute br0
       valid_lft forever preferred_lft forever

[root@sj-sv2-devop-80 ~]# bridge link
2: enp17s0f0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 master br0 state forwarding priority 32 cost 100 
2: enp17s0f0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 master br0 hwmode VEPA 
3: enp17s0f1: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 hwmode VEPA 
16: one-104-0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 master br0 state forwarding priority 32 cost 100 

[root@sj-sv2-devop-80 ~]# nmcli con show
NAME                UUID                                  TYPE      DEVICE
br0                 d03bc4fb-1804-4d80-a217-95a08aa3db81  bridge    br0
slave-enp17s0f0     aaf7715a-c7a9-4065-a1bf-09ddf1c62159  ethernet  enp17s0f0
lo                  98a8a6c1-bdd7-48b5-8d16-61ad9d2fb780  loopback  lo
one-104-0           f83ecc20-6352-43d6-9ad8-559b252a3e87  tun       one-104-0
enp17s0f0           7e579761-89ee-4533-9f68-cba2a52dbb8f  ethernet  --

Thanks again for your help in tracking this down!

Hello,

About my previous message, forget it. NetworkManager will keep the configuration OK.

The ring buffer size is primordial (probably , but there are some other network config parameters:

  • I understand that sysctl net.ipv4.tcp_window_scaling is already on, as well as the MTU is as big as possible (9000 recommended)
  • Check the values for net.ipv4.tcp_rmem and net.ipv4.tcp_wmem. Depending on the amount of connections these can increase the usage of memory
  • if your network card support ARFS enabling it may improve this as well

Cheers!

Hi Bruno,

Thanks for confirming the bridge setup! Good to know we are on solid ground there.

Based on your suggestions I checked our current baseline for those parameters, and the results are below:

TCP Scaling & Buffers:

[root@sj-sv2-devop-80 ~]# sysctl net.ipv4.tcp_window_scaling net.ipv4.tcp_rmem net.ipv4.tcp_wmem
net.ipv4.tcp_window_scaling = 1
net.ipv4.tcp_rmem = 4096        131072  6291456
net.ipv4.tcp_wmem = 4096        16384   4194304

Current MTU:

[root@sj-sv2-devop-80 ~]# ip link show enp17s0f0 | grep mtu
2: enp17s0f0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq master br0 state UP mode DEFAULT group default qlen 1000

ARFS (ntuple-filters) Status:

[root@sj-sv2-devop-80 ~]# ethtool -k enp17s0f0 | grep ntuple
ntuple-filters: on

Please look at and let me know if anything looks weird to you

Cheers!

Hey, I think that setting up your MTU to 9000 (your switches must support it) will make a difference. It may not mean a great throughput improvement, but your servers and your network hardware will have to process less packets. For what I remember, Ceph recommends jumbo frames by default.

About the window scaling and rmem and wmem parameter, they do not look like they should be a big problem.

Cheers!

Hi Bruno,

That makes sense. Enabling Jumbo Frames to reduce the packet and interrupt load on the i40e driver seems like a reasonable next step alongside the ring buffer changes.

I’ll coordinate with our networking team to confirm whether Jumbo Frames are supported and enabled on the Top-of-Rack switches. Based on that discussion, we’ll evaluate updating the MTU to 9000 on the relevant interfaces and bridges.

Thanks again for taking the time to review the configuration and point us in the right direction. I appreciate the help.

Best regards,

Laxman