k3s in worker node(agent) get OOM if api server is not reachable for a while #11346

liyimeng · 2024-11-20T10:14:53Z

Environmental Info:
K3s Version: 1.28.9

Node(s) CPU architecture, OS, and Version:

amd64, ubuntu 22.04

Cluster Configuration:

3 servers, 1 agent
Describe the bug:

When all server nodes go offline, agent not keep trying to reach api server, but failed. This is expected. However, if servers are not back on time, k3s agent will consume a significant ram and endup in the OOM killed, putting the traffics on the node into totally in-functional.
Steps To Reproduce:

Installed K3s: Install a cluster with 3 server and 1 agent, make sure all work as expected.
Turn off all server nodes.
Wait for a while(depend on available RAM), k3s service memory usage on agent node is observed increasing all the time, and end up in a OOM killed.
To make the bug popup more quickly, you can use an agent node with load RAM.

Expected behavior:

Even api server is offline, agent node should keep running as it is, keep the pods on it running.
Because in upstream k8s design, pods will keep running and will restart if they crash but the API will not be available so it will not be possible to run anything new or change them.
Actual behavior:

k3s server use out the system memory, make node unstable, and k3s itself get oom killed, user load get removed after k3s get restarted.

Additional context / logs:
attach screenshot show k3s get killed with oom

brandond · 2024-11-20T17:36:07Z

Likely duplicate of [Release-1.28] - Loadbalancer may panic due to race condition when selecting a new server #10320.

v1.28.9 is several months old, please update.

liyimeng · 2024-11-22T11:59:07Z

@brandond some misunderstanding here. What I report here is oom in worker node after I fix the panic, with the similar way you have done in #10320.

I am not complaining about panic, but the memory leakage after fixing panic.

Please re-open.

brandond · 2024-11-22T17:10:56Z

Please upgrade to a more recent release and confirm if you still see the issue. v1.28 is end of life as of last month.

liyimeng · 2024-11-24T11:18:31Z

@brandond Got it, I'll do the upgrade first.

liyimeng · 2024-12-01T17:24:09Z

@brandond Bad news, with 1.31.2, memory leakage still exists. it is not drastically, but stably leaking. In the picture, is a node running for about 36 hours, memory usage is growing from about 1% to 24%.

liyimeng · 2024-12-04T11:07:05Z

@brandond Can we re-open this one?

brandond · 2024-12-04T19:41:48Z

Is this when the apiserver is available, or just under normal operation? ~900MB of memory doesn't seem particularly excessive. Note that golang's garbage collector won't aggressively free memory unless it has to, so what you're seeing isn't necessarily a leak or even unusual.

Please start the agent with --enable-pprof, and then grab a profile once the memory has increased:
kubectl get --server https://AGENT-IP:6443 --raw /debug/pprof/heap?seconds=120 > heap.out

liyimeng · 2024-12-04T20:42:32Z

This only happens when api server is offline(agent is not able to reach server). If I leave it in this situation, it will end up with an oom killed when time goes long enough.

While api server is online, the node run for month without any issue.

I will come back when I get the log collected. It will take a while, at least a couple of hours.

brandond · 2024-12-04T22:31:21Z

If you can, grab both total and a delta heap profiles

kubectl get --server https://AGENT-IP:6443 --raw /debug/pprof/heap > heap.out
kubectl get --server https://AGENT-IP:6443 --raw /debug/pprof/heap?seconds=120 > heap.delta.out

liyimeng · 2025-01-08T04:45:19Z

@brandond we continue testing with 1.31.4, problem seems gone 👯

brandond · 2025-01-08T08:13:59Z

🤷

github-project-automation bot added this to K3s Development Nov 20, 2024

github-project-automation bot moved this to New in K3s Development Nov 20, 2024

liyimeng mentioned this issue Nov 20, 2024

Rework loadbalancer server selection logic #11329

Merged

brandond closed this as completed Nov 20, 2024

github-project-automation bot moved this from New to Done Issue in K3s Development Nov 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

k3s in worker node(agent) get OOM if api server is not reachable for a while #11346

k3s in worker node(agent) get OOM if api server is not reachable for a while #11346

liyimeng commented Nov 20, 2024 •

edited

Loading

brandond commented Nov 20, 2024 •

edited

Loading

liyimeng commented Nov 22, 2024 •

edited

Loading

brandond commented Nov 22, 2024

liyimeng commented Nov 24, 2024

liyimeng commented Dec 1, 2024 •

edited

Loading

liyimeng commented Dec 4, 2024

brandond commented Dec 4, 2024 •

edited

Loading

liyimeng commented Dec 4, 2024

brandond commented Dec 4, 2024 •

edited

Loading

liyimeng commented Jan 8, 2025

brandond commented Jan 8, 2025

k3s in worker node(agent) get OOM if api server is not reachable for a while #11346

k3s in worker node(agent) get OOM if api server is not reachable for a while #11346

Comments

liyimeng commented Nov 20, 2024 • edited Loading

brandond commented Nov 20, 2024 • edited Loading

liyimeng commented Nov 22, 2024 • edited Loading

brandond commented Nov 22, 2024

liyimeng commented Nov 24, 2024

liyimeng commented Dec 1, 2024 • edited Loading

liyimeng commented Dec 4, 2024

brandond commented Dec 4, 2024 • edited Loading

liyimeng commented Dec 4, 2024

brandond commented Dec 4, 2024 • edited Loading

liyimeng commented Jan 8, 2025

brandond commented Jan 8, 2025

liyimeng commented Nov 20, 2024 •

edited

Loading

brandond commented Nov 20, 2024 •

edited

Loading

liyimeng commented Nov 22, 2024 •

edited

Loading

liyimeng commented Dec 1, 2024 •

edited

Loading

brandond commented Dec 4, 2024 •

edited

Loading

brandond commented Dec 4, 2024 •

edited

Loading