I have a bunch of independent clusters deployed and all the master nodes currently have the same problem.
After deploying everything I needed on the clusters and let it run for a couple of weeks without any problems all the master nodes started prompting the following during the scheduled restart of the rke2-server service:
INFO[0721] Container for etcd is running
INFO[0721] Container for kube-apiserver not found (no matching container found), retrying
INFO[0725] Waiting for API server to become available
INFO[0741] Container for etcd is running
INFO[0741] Container for kube-apiserver not found (no matching container found), retrying
INFO[0741] Waiting for API server to become available
INFO[0755] Waiting for API server to become available
.
.
.
INFO[0901] Container for etcd is running
INFO[0901] Container for kube-apiserver not found (no matching container found), retrying
INFO[0905] Waiting for API server to become available
FATA[0921] Failed to get request handlers from apiserver: timed out waiting for the condition, failed to get apiserver /readyz status: Get "https://127.0.0.1:6443/readyz": dial tcp 127.0.0.1:6443: connect: connection refused
This error makes managing the clusters imposible.
Help will be appreciate,
Thanks
Versions of the related components:
rke2 version v1.27.2+rke2r1 (300a06dabe679c779970112a9cb48b289c17536c)
go version go1.20.4 X:boringcrypto
I that case you could check in the VNF if the HAProxy instance has all the backends configured /etc/haproxy/haproxy.cfg. I suspect this may be the problem, in such case you may want to replace the VNF image from the latest OneKE in the marketplace and replace all VNF instances.
So this is the old VNF image for sure, we’ve replaced it completely with a new one vr_balancing · OpenNebula/one-apps Wiki · GitHub. But if you have just one master and 3 nodes, then it seems to be correct. I guess you need to look for problems in RKE2 logs themselves.
The first logs of this thread are from the rke2-server, I don’t see anything that rings a bell on me on this logs. The only thing I know is that the service is not able to get the kube-apiserver image, is strange because the rest of the images needed to run the service are pulled without problems.
$ crictl pull docker.io/rancher/hardened-kubernetes:v1.27.2-rke2r1-build20230518
Image is up to date for sha256:77a5bb5822f668bac88c0722c9fa1dd210efef9a5c9896c73cf37bd0859e87d2
But I don’t think that will help, I’d rather try to verify if the LB is actually operational and if not I’d try replacing VNF image with the latest one, that has completely new + much simpler implementation.
Given that this image is available in the machine I’ll try to check de LB and the VNF.
root@oneke-ip-10-100-100-12:~# crictl images | grep build20230518
docker.io/rancher/hardened-kubernetes v1.27.2-rke2r1-build20230518 77a5bb5822f66 695MB
root@oneke-ip-10-100-100-12:~# crictl pull docker.io/rancher/hardened-kubernetes:v1.27.2-rke2r1-build20230518
Image is up to date for sha256:77a5bb5822f668bac88c0722c9fa1dd210efef9a5c9896c73cf37bd0859e87d2