Environment:
-
Architecture: 3-node Hyperconverged Cluster (Compute + Storage on the same nodes)
-
Compute: OpenNebula (KVM/Libvirt) with HA enabled.
-
Storage: Ceph (2-way replication, NVMe backed).
-
Hardware: Supermicro servers
-
OS: RHEL 9.6
The Issue:
When a Jenkins VM runs a legacy build pipeline (starting with a massive git clone followed by a heavy compilation make commands) Shortly after this starts, the host’s got shutdown looks like (primary network interface crashes and resets, causing Ceph OSDs to flap, QEMU to lose block storage, and OpenNebula to mark the host as ERROR and evacuate the VM.) The VM migrates to the next node, repeats the workload, and crashes that node as well.
Hardware monitoring shows CPU, RAM, and NVMe utilization are all barely breaking (iowait is ~0.00, disk util <1%).
The logs shows
May 18 04:21:10 kernel: i40e 0000:11:00.0 enp17s0f0: NETDEV WATCHDOG: CPU: 472: transmit queue 61 timed out 5584 ms
May 18 04:21:10 kernel: i40e 0000:11:00.0 enp17s0f0: tx_timeout: VSI_seid: 390, Q 61, NTC: 0x150, HWB: 0x150, NTU: 0x174, TAIL: 0x174, INT: 0x1
May 18 04:21:10 kernel: i40e 0000:11:00.0 enp17s0f0: tx_timeout recovery level 1, txqueue 61
May 18 04:21:34 kernel: i40e 0000:11:00.0 enp17s0f0: NIC Link is Down
May 18 04:21:40 kernel: i40e 0000:11:00.0 enp17s0f0: NIC Link is Up, 10 Gbps Full Duplex
May 18 04:21:26 ceph-osd[12166]: osd.4 845 heartbeat_check: no reply from 10.96.xx.xx:6832 osd.11
May 19 02:27:55 virtqemud[8829]: internal error: QEMU unexpectedly closed the monitor (vm='one-115'): ... error connecting: Connection timed out
Not sure if this issue is caused by Ceph/OpenNebula or is there any specific configuration is missing
Any help on this issue would be greatly appreciated!
Regards,
Laxman Singh Ahirwar