oneKE kube-apiservers down

Hello,

I have a bunch of independent clusters deployed and all the master nodes currently have the same problem.
After deploying everything I needed on the clusters and let it run for a couple of weeks without any problems all the master nodes started prompting the following during the scheduled restart of the rke2-server service:

INFO[0721] Container for etcd is running
INFO[0721] Container for kube-apiserver not found (no matching container found), retrying
INFO[0725] Waiting for API server to become available
INFO[0741] Container for etcd is running
INFO[0741] Container for kube-apiserver not found (no matching container found), retrying
INFO[0741] Waiting for API server to become available
INFO[0755] Waiting for API server to become available
.
.
.
INFO[0901] Container for etcd is running
INFO[0901] Container for kube-apiserver not found (no matching container found), retrying
INFO[0905] Waiting for API server to become available
FATA[0921] Failed to get request handlers from apiserver: timed out waiting for the condition, failed to get apiserver /readyz status: Get "https://127.0.0.1:6443/readyz": dial tcp 127.0.0.1:6443: connect: connection refused

This error makes managing the clusters imposible.

Help will be appreciate,
Thanks


Versions of the related components:
rke2 version v1.27.2+rke2r1 (300a06dabe679c779970112a9cb48b289c17536c)
go version go1.20.4 X:boringcrypto

Hi @SysAdminHorror,

Could you double check your resource usage, like if your storage isn’t full or master VMs have at least 3G of RAM? :thinking:

Hello @mopala,

Everything here seems fine to me:

root@oneke-ip-10-100-100-12:~# df -h
Filesystem      Size  Used Avail Use% Mounted on
tmpfs           297M   11M  287M   4% /run
/dev/vda1        20G   14G  6.1G  69% /
tmpfs           1.5G     0  1.5G   0% /dev/shm
tmpfs           5.0M     0  5.0M   0% /run/lock
/dev/vda15      105M  6.1M   99M   6% /boot/efi
shm              64M     0   64M   0% /run/k3s/containerd/io.containerd.grpc.v1.cri/sandboxes/a4505f79cf0979a6df9f353f38e34aafddb97c02e78d1a0958ef5b0c4d24992f/shm
shm              64M     0   64M   0% /run/k3s/containerd/io.containerd.grpc.v1.cri/sandboxes/2b6d16ef348379e1fe3802f493e6c20928320a214783fe6c6b447338a85c17d3/shm
shm              64M     0   64M   0% /run/k3s/containerd/io.containerd.grpc.v1.cri/sandboxes/02e7bd1592a7ee6aae429b53b7d1c904ee2ea0461cff8fb493c1a210ffd7bbd2/shm
shm              64M     0   64M   0% /run/k3s/containerd/io.containerd.grpc.v1.cri/sandboxes/1017de7cfacbc51e474c847b556e385806f0638af8914745b0df4a20e12713d2/shm
overlay          20G   14G  6.1G  69% /run/k3s/containerd/io.containerd.runtime.v2.task/k8s.io/2b6d16ef348379e1fe3802f493e6c20928320a214783fe6c6b447338a85c17d3/rootfs
overlay          20G   14G  6.1G  69% /run/k3s/containerd/io.containerd.runtime.v2.task/k8s.io/1017de7cfacbc51e474c847b556e385806f0638af8914745b0df4a20e12713d2/rootfs
overlay          20G   14G  6.1G  69% /run/k3s/containerd/io.containerd.runtime.v2.task/k8s.io/02e7bd1592a7ee6aae429b53b7d1c904ee2ea0461cff8fb493c1a210ffd7bbd2/rootfs
overlay          20G   14G  6.1G  69% /run/k3s/containerd/io.containerd.runtime.v2.task/k8s.io/a4505f79cf0979a6df9f353f38e34aafddb97c02e78d1a0958ef5b0c4d24992f/rootfs
shm              64M     0   64M   0% /run/k3s/containerd/io.containerd.grpc.v1.cri/sandboxes/80f57aea675e55ddbfea707f86aade78c9a7d1e5ff89b54ad5c8a24910db791e/shm
overlay          20G   14G  6.1G  69% /run/k3s/containerd/io.containerd.runtime.v2.task/k8s.io/80f57aea675e55ddbfea707f86aade78c9a7d1e5ff89b54ad5c8a24910db791e/rootfs
overlay          20G   14G  6.1G  69% /run/k3s/containerd/io.containerd.runtime.v2.task/k8s.io/a6d34d9aee2ed58089e4608ed4d866571baab13baeb97646a197643fc00f408d/rootfs
tmpfs           2.9G     0  2.9G   0% /var/lib/kubelet/pods/8002c985-a8c4-422f-9a74-9d7fcfee2dda/volumes/kubernetes.io~secret/clustermesh-secrets
tmpfs           2.9G   12K  2.9G   1% /var/lib/kubelet/pods/0e04a51e-2d5c-4e61-b52e-8c9df180c199/volumes/kubernetes.io~projected/kube-api-access-tv9ps
tmpfs           2.9G   12K  2.9G   1% /var/lib/kubelet/pods/8002c985-a8c4-422f-9a74-9d7fcfee2dda/volumes/kubernetes.io~projected/kube-api-access-f8smf
shm              64M     0   64M   0% /run/k3s/containerd/io.containerd.grpc.v1.cri/sandboxes/e485e10ebbf19c6014612026198706b355abb7aa5d6a5c370d00d41854ba9345/shm
overlay          20G   14G  6.1G  69% /run/k3s/containerd/io.containerd.runtime.v2.task/k8s.io/e485e10ebbf19c6014612026198706b355abb7aa5d6a5c370d00d41854ba9345/rootfs
shm              64M     0   64M   0% /run/k3s/containerd/io.containerd.grpc.v1.cri/sandboxes/d8e28c8a16475702341d7171a15657911aaec0d29e84a688e332928a09a5d3ff/shm
overlay          20G   14G  6.1G  69% /run/k3s/containerd/io.containerd.runtime.v2.task/k8s.io/d8e28c8a16475702341d7171a15657911aaec0d29e84a688e332928a09a5d3ff/rootfs
overlay          20G   14G  6.1G  69% /run/k3s/containerd/io.containerd.runtime.v2.task/k8s.io/ddaa083262af03ece8773a2bc92c076a05560ce2aa629ac6cdf5f4d54b92a903/rootfs
overlay          20G   14G  6.1G  69% /run/k3s/containerd/io.containerd.runtime.v2.task/k8s.io/9d4cc9427a5face694f1c795a18a9a3950b181f3400b099c13479010a3c320e1/rootfs
tmpfs           2.9G   12K  2.9G   1% /var/lib/kubelet/pods/a1cb0cd6-306d-4aaa-8a2b-454da622681e/volumes/kubernetes.io~projected/kube-api-access-4qbq5
shm              64M     0   64M   0% /run/k3s/containerd/io.containerd.grpc.v1.cri/sandboxes/530582ae8360a1970c632d86a4a0ab498cb7a6d7b602f7b487e6efeabf703706/shm
overlay          20G   14G  6.1G  69% /run/k3s/containerd/io.containerd.runtime.v2.task/k8s.io/530582ae8360a1970c632d86a4a0ab498cb7a6d7b602f7b487e6efeabf703706/rootfs
tmpfs           297M     0  297M   0% /run/user/0
root@oneke-ip-10-100-100-12:~# free -h
               total        used        free      shared  buff/cache   available
Mem:           2.9Gi       470Mi       330Mi        10Mi       2.1Gi       2.2Gi
Swap:             0B          0B          0B

I that case you could check in the VNF if the HAProxy instance has all the backends configured /etc/haproxy/haproxy.cfg. I suspect this may be the problem, in such case you may want to replace the VNF image from the latest OneKE in the marketplace and replace all VNF instances. :thinking:

This is the current configuration of the VNF HAProxy backends:

localhost:~# cat /etc/haproxy/haproxy.cfg
.
.
.
backend 31302e39352e38322e38303a39333435
    mode tcp
    balance roundrobin
    option tcp-check
backend 31302e39352e38322e38303a36343433
    mode tcp
    balance roundrobin
    option tcp-check
    server 31302e3130302e3130302e31323a36343433 10.100.100.12:6443 check observe layer4 error-limit 50 on-error mark-down
backend 31302e39352e38322e38303a343433
    mode tcp
    balance roundrobin
    option tcp-check
    server 31302e3130302e3130302e31393a3332343433 10.100.100.19:32443 check observe layer4 error-limit 50 on-error mark-down
    server 31302e3130302e3130302e31333a3332343433 10.100.100.13:32443 check observe layer4 error-limit 50 on-error mark-down
    server 31302e3130302e3130302e31363a3332343433 10.100.100.16:32443 check observe layer4 error-limit 50 on-error mark-down
backend 31302e39352e38322e38303a3830
    mode tcp
    balance roundrobin
    option tcp-check
    server 31302e3130302e3130302e31393a3332303830 10.100.100.19:32080 check observe layer4 error-limit 50 on-error mark-down
    server 31302e3130302e3130302e31333a3332303830 10.100.100.13:32080 check observe layer4 error-limit 50 on-error mark-down
    server 31302e3130302e3130302e31363a3332303830 10.100.100.16:32080 check observe layer4 error-limit 50 on-error mark-down

Is something misconfigured in this file?? :thinking:

So this is the old VNF image for sure, we’ve replaced it completely with a new one vr_balancing · OpenNebula/one-apps Wiki · GitHub. But if you have just one master and 3 nodes, then it seems to be correct. I guess you need to look for problems in RKE2 logs themselves. :thinking:

The first logs of this thread are from the rke2-server, I don’t see anything that rings a bell on me on this logs. The only thing I know is that the service is not able to get the kube-apiserver image, is strange because the rest of the images needed to run the service are pulled without problems.

Hi @SysAdminHorror,

That’s the image in a working cluster:

  kube-apiserver:
    Image:         index.docker.io/rancher/hardened-kubernetes:v1.27.2-rke2r1-build20230518

You can list images using crictl images on the master.

docker.io/rancher/hardened-kubernetes    v1.27.2-rke2r1-build20230518   77a5bb5822f66       217MB

You can try pulling it manually I guess :thinking:

$ crictl pull docker.io/rancher/hardened-kubernetes:v1.27.2-rke2r1-build20230518
Image is up to date for sha256:77a5bb5822f668bac88c0722c9fa1dd210efef9a5c9896c73cf37bd0859e87d2

But I don’t think that will help, I’d rather try to verify if the LB is actually operational and if not I’d try replacing VNF image with the latest one, that has completely new + much simpler implementation.

Hi @mopala,

First of all thank you for the help,

Given that this image is available in the machine I’ll try to check de LB and the VNF.

root@oneke-ip-10-100-100-12:~# crictl images | grep build20230518
docker.io/rancher/hardened-kubernetes                                v1.27.2-rke2r1-build20230518               77a5bb5822f66       695MB
root@oneke-ip-10-100-100-12:~# crictl pull docker.io/rancher/hardened-kubernetes:v1.27.2-rke2r1-build20230518
Image is up to date for sha256:77a5bb5822f668bac88c0722c9fa1dd210efef9a5c9896c73cf37bd0859e87d2