Replies: 5 comments 6 replies
-
I believe this is a duplicate of #6208. Please reopen if you don't see this resolved when using a July release. |
Beta Was this translation helpful? Give feedback.
-
Hi @brandond thanks for looking into this. I still see the same issue in the latest release
Instead of removing the etcd member, I filled up the disk on a server to reproduce the error. The difference is that:
Steps To Reproduce
Here, node3 (100.64.0.30) is the connected API server.
The client node |
Beta Was this translation helpful? Give feedback.
-
I don't have permission to reopen. Could you please reopen this issue? |
Beta Was this translation helpful? Give feedback.
-
I'm not sure that we're going to get a LOT better than that. Internal monitoring, statistics collection, and health checks only run periodically - generally about once a minute. If the server continues operating in a degraded state for a short period, the agent is not going to fail over immediately. It would probably be more productive to work on ensuring that your nodes have sufficient resources and you have alerting in place to warn you when disks are being filled. Another best practice commonly observed for performance purposes is to put the etcd datastore on a dedicated disk or partition, to isolate it from workload and image store disk IO. |
Beta Was this translation helpful? Give feedback.
-
hi @brandond, thanks for the quick response. The same error could happen when the disk fails and the filesystem immediately becomes read-only, so we cannot prevent this error from happening. When this error happens, one-third of the agent nodes become "NotReady". K8s has a config option The kube apiserver pod liveness probes define when an apiserver instance should be removed from service endpoints. Could the rke2 load balancer use this as well? |
Beta Was this translation helpful? Give feedback.
-
Environmental Info:
RKE2 Version:
v1.28.11+rke2r1
Node(s) CPU architecture, OS, and Version:
Cluster Configuration:
3 servers, 1 agent
Describe the bug:
The rke2 agent loadbalancer has no health check. When an API server becomes unhealthy but still accepts TCP connections, the connected agent keeps using the TCP connection to the unhealthy API server for some time (~1 minute). The agent node becomes "NotReady" during this time. It automatically recovers eventually (after 1 minute).
The desired behavior is that the agent could switch to a new API server immediately after the connected API server becomes unhealthy. The agent node should never become "NotReady" during this.
Steps To Reproduce:
Here, node3 (100.64.0.30) is the connected API server.
15e000cbe89e5629
:API server on node3 becomes unhealthy immediately:
Expected behavior:
The agent node (node4) should immediately reconnect to a healthy API server.
Actual behavior:
The agent node becomes "NotReady" for some time:
Additional context:
We found this issue in
v1.28.6+rke2r1
when a server node had disk full errors. However, inv1.28.6+rke2r1
, the agents were stuck in "NotReady" and could not recover until the unhealthy API server was killed.Beta Was this translation helpful? Give feedback.
All reactions