Agent node become "NotReady" for a short period of time when an API server is unhealthy #6445

mechpen · 2024-07-19T06:31:03Z

mechpen
Jul 19, 2024

Environmental Info:

RKE2 Version:
v1.28.11+rke2r1

Node(s) CPU architecture, OS, and Version:

Linux node2 6.9.10-arch1-1 #1 SMP PREEMPT_DYNAMIC Thu, 18 Jul 2024 18:06:13 +0000 x86_64 GNU/Linux

Cluster Configuration:
3 servers, 1 agent

Describe the bug:

The rke2 agent loadbalancer has no health check. When an API server becomes unhealthy but still accepts TCP connections, the connected agent keeps using the TCP connection to the unhealthy API server for some time (~1 minute). The agent node becomes "NotReady" during this time. It automatically recovers eventually (after 1 minute).

The desired behavior is that the agent could switch to a new API server immediately after the connected API server becomes unhealthy. The agent node should never become "NotReady" during this.

Steps To Reproduce:

Setup rke2 cluster with 3 servers and 1 agent. No custom config.

node1 100.64.0.10 server
node2 100.64.0.20 server
node3 100.64.0.30 server
node4 100.64.0.40 agent

On the agent node (node4), find out the API server it connects, for example:

[root@node4 ~]# ss -tn | grep 6443 | grep 100.64.0.40
ESTAB 0      0        100.64.0.40:44498  100.64.0.30:6443
ESTAB 0      0        100.64.0.40:44534  100.64.0.30:6443
ESTAB 0      0        100.64.0.40:44518  100.64.0.30:6443

Here, node3 (100.64.0.30) is the connected API server.

Remove node3 from etcd member list to make it unhealthy. The etcd member ID of node3 is 15e000cbe89e5629:

[root@node2 ~]# date; etcdctl member remove 15e000cbe89e5629
Fri Jul 19 05:14:27 UTC 2024
Member 15e000cbe89e5629 removed from cluster 7e579e1c023c6df1

API server on node3 becomes unhealthy immediately:

[root@node3 ~]# date; kubectl get --raw /healthz
Fri Jul 19 05:14:30 UTC 2024
Error from server (InternalError): an error on the server ("... [-]etcd failed: reason withheld...")

Expected behavior:

The agent node (node4) should immediately reconnect to a healthy API server.

Actual behavior:

The agent node becomes "NotReady" for some time:

[root@node2 ~]# date; kubectl get nodes
Fri Jul 19 05:15:31 UTC 2024
NAME    STATUS     ROLES                       AGE   VERSION
node1   Ready      control-plane,etcd,master   56m   v1.28.11+rke2r1
node2   Ready      control-plane,etcd,master   28m   v1.28.11+rke2r1
node3   NotReady   control-plane,etcd,master   28m   v1.28.11+rke2r1
node4   NotReady   <none>                      20m   v1.28.11+rke2r1

Additional context:

We found this issue in v1.28.6+rke2r1 when a server node had disk full errors. However, in v1.28.6+rke2r1, the agents were stuck in "NotReady" and could not recover until the unhealthy API server was killed.

brandond · 2024-07-22T20:08:10Z

brandond
Jul 22, 2024
Maintainer

I believe this is a duplicate of #6208. Please reopen if you don't see this resolved when using a July release.

0 replies

mechpen · 2024-07-31T23:04:51Z

mechpen
Jul 31, 2024
Author

Hi @brandond thanks for looking into this. I still see the same issue in the latest release v1.30.3-rc4+rke2r1.

[root@node4 ~]# rke2 --version
rke2 version v1.30.3-rc4+rke2r1 (370f53b3794863b6481adc9a7f5a838d1fca66ac)
go version go1.22.5 X:boringcrypto

Instead of removing the etcd member, I filled up the disk on a server to reproduce the error. The difference is that:

removing etcd member: rke2 server immediately errors and exits
filling up the disk: rke2 server continues running for some time and then exits.

Steps To Reproduce

Setup rke2 cluster with 3 servers and 1 agent. No custom config.

node1 100.64.0.10 server
node2 100.64.0.20 server
node3 100.64.0.30 server
node4 100.64.0.40 agent

On the agent node (node4), find out the API server it connects, for example:

[root@node4 ~]# ss -tn | grep 6443 | grep 100.64.0.40
ESTAB 0      0               100.64.0.40:55726         100.64.0.30:6443
ESTAB 0      0               100.64.0.40:38952         100.64.0.30:6443
ESTAB 0      0               100.64.0.40:55744         100.64.0.30:6443

Here, node3 (100.64.0.30) is the connected API server.

Fill up disk on node3 and watch kube node status

[root@node3 ~]# dd if=/dev/zero bs=8096 >> zero_data1; date
dd: error writing 'standard output': No space left on device
1+0 records in
0+0 records out
0 bytes copied, 0.0147735 s, 0.0 kB/s
Wed Jul 31 22:49:17 UTC 2024

root@node1 ~]# date; k get nodes
Wed Jul 31 22:50:17 UTC 2024
NAME    STATUS     ROLES                       AGE     VERSION
node1   Ready      control-plane,etcd,master   11m     v1.30.3+rke2r1
node2   Ready      control-plane,etcd,master   7m11s   v1.30.3+rke2r1
node3   NotReady   control-plane,etcd,master   7m13s   v1.30.3+rke2r1
node4   NotReady   <none>                      6m5s    v1.30.3+rke2r1

The client node node4 was NotReady for at least 1 minute, from 22:49:17 to 22:50:17.

0 replies

mechpen · 2024-07-31T23:07:46Z

mechpen
Jul 31, 2024
Author

I believe this is a duplicate of #6208. Please reopen if you don't see this resolved when using a July release.

I don't have permission to reopen. Could you please reopen this issue?

0 replies

brandond · 2024-07-31T23:12:28Z

brandond
Jul 31, 2024
Maintainer

filling up the disk: rke2 server continues running for some time and then exits.
The client node node4 was NotReady for at least 1 minute, from 22:49:17 to 22:50:17.

I'm not sure that we're going to get a LOT better than that. Internal monitoring, statistics collection, and health checks only run periodically - generally about once a minute. If the server continues operating in a degraded state for a short period, the agent is not going to fail over immediately.

It would probably be more productive to work on ensuring that your nodes have sufficient resources and you have alerting in place to warn you when disks are being filled. Another best practice commonly observed for performance purposes is to put the etcd datastore on a dedicated disk or partition, to isolate it from workload and image store disk IO.

0 replies

mechpen · 2024-08-01T00:21:27Z

mechpen
Aug 1, 2024
Author

hi @brandond, thanks for the quick response.

The same error could happen when the disk fails and the filesystem immediately becomes read-only, so we cannot prevent this error from happening. When this error happens, one-third of the agent nodes become "NotReady".

K8s has a config option node-monitor-grace-period to specify how long to mark a node as unhealthy. The default value is 40 seconds. We could increase this value to e.g. 90 seconds to prevent the node from becoming "NotReady". On the other hand, is there any config option to reduce the rke2 health check period to <40 seconds?

The kube apiserver pod liveness probes define when an apiserver instance should be removed from service endpoints. Could the rke2 load balancer use this as well?

6 replies

mechpen Aug 1, 2024
Author

We could increase this value to e.g. 90 seconds to prevent the node from becoming "NotReady"

By increasing the grace period, you'd be increasing the time that it will take to respond to an actual outage of a node. What is the specific problem that you're trying to solve? Are you seeing any negative effects from this, or are you just trying to avoid seeing some nodes go NotReady for a moment when an apiserver suffers an unexpected outage?

We may have an operator that acts on the node status, or a proxy that uses the node's status to determine if endpoints on the node are healthy.

I think this is about the SLA of k8s/rke2 HA feature: should a node's availability be affected when an apiserver fails.

brandond Aug 1, 2024
Maintainer

We do not have a defined SLA for failover time when the control plane experiences a hardware failure that causes a soft lock instead of a hard outage. I'm not sure that the upstream Kubernetes project has any advertised figures in this space either. This is something we could work on as an enhancement.

mechpen Aug 1, 2024
Author

I ran a few more tests to find out how regular k8s services change when this failure happens. Kubernetes removes a pod address from the service endpoints when the pod's node is "NotReady". So this means about one-third of service endpoints are temporarily gone (for >20 seconds in my case). This may cause service outages if all the pods of the service are on the affected nodes.

brandond Aug 1, 2024
Maintainer

You might consider using use topology spread constraints to ensure that your pods are evenly spread across nodes:
https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#pod-topology-spread-constraints

mechpen Aug 2, 2024
Author

You might consider using use topology spread constraints to ensure that your pods are evenly spread across nodes: https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#pod-topology-spread-constraints

Spread constraints won't help here. For example, if a cluster has 100 client nodes, a service with 3 pods runs on nodeA, nodeB, and nodeC. The rke2 proxy on all these three nodes could connect to the same apiserver. If this apiserver fails, the service has an outage.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Agent node become "NotReady" for a short period of time when an API server is unhealthy #6445

{{title}}

Replies: 5 comments 6 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Agent node become "NotReady" for a short period of time when an API server is unhealthy #6445

mechpen Jul 19, 2024

Replies: 5 comments · 6 replies

brandond Jul 22, 2024 Maintainer

mechpen Jul 31, 2024 Author

mechpen Jul 31, 2024 Author

brandond Jul 31, 2024 Maintainer

mechpen Aug 1, 2024 Author

mechpen Aug 1, 2024 Author

brandond Aug 1, 2024 Maintainer

mechpen Aug 1, 2024 Author

brandond Aug 1, 2024 Maintainer

mechpen Aug 2, 2024 Author

mechpen
Jul 19, 2024

Replies: 5 comments 6 replies

brandond
Jul 22, 2024
Maintainer

mechpen
Jul 31, 2024
Author

mechpen
Jul 31, 2024
Author

brandond
Jul 31, 2024
Maintainer

mechpen
Aug 1, 2024
Author

mechpen Aug 1, 2024
Author

brandond Aug 1, 2024
Maintainer

mechpen Aug 1, 2024
Author

brandond Aug 1, 2024
Maintainer

mechpen Aug 2, 2024
Author