Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

k3s in worker node(agent) get OOM if api server is not reachable for a while #11346

Closed
liyimeng opened this issue Nov 20, 2024 · 11 comments
Closed

Comments

@liyimeng
Copy link
Contributor

liyimeng commented Nov 20, 2024

Environmental Info:
K3s Version: 1.28.9

Node(s) CPU architecture, OS, and Version:

amd64, ubuntu 22.04

Cluster Configuration:

3 servers, 1 agent
Describe the bug:

When all server nodes go offline, agent not keep trying to reach api server, but failed. This is expected. However, if servers are not back on time, k3s agent will consume a significant ram and endup in the OOM killed, putting the traffics on the node into totally in-functional.
Steps To Reproduce:

  • Installed K3s: Install a cluster with 3 server and 1 agent, make sure all work as expected.
  • Turn off all server nodes.
  • Wait for a while(depend on available RAM), k3s service memory usage on agent node is observed increasing all the time, and end up in a OOM killed.
    To make the bug popup more quickly, you can use an agent node with load RAM.

Expected behavior:

Even api server is offline, agent node should keep running as it is, keep the pods on it running.
Because in upstream k8s design, pods will keep running and will restart if they crash but the API will not be available so it will not be possible to run anything new or change them.
Actual behavior:

k3s server use out the system memory, make node unstable, and k3s itself get oom killed, user load get removed after k3s get restarted.

Additional context / logs:
attach screenshot show k3s get killed with oom

image

@brandond
Copy link
Member

brandond commented Nov 20, 2024

v1.28.9 is several months old, please update.

@github-project-automation github-project-automation bot moved this from New to Done Issue in K3s Development Nov 20, 2024
@liyimeng
Copy link
Contributor Author

liyimeng commented Nov 22, 2024

@brandond some misunderstanding here. What I report here is oom in worker node after I fix the panic, with the similar way you have done in #10320.

I am not complaining about panic, but the memory leakage after fixing panic.

Please re-open.

@brandond
Copy link
Member

Please upgrade to a more recent release and confirm if you still see the issue. v1.28 is end of life as of last month.

@liyimeng
Copy link
Contributor Author

@brandond Got it, I'll do the upgrade first.

@liyimeng
Copy link
Contributor Author

liyimeng commented Dec 1, 2024

@brandond Bad news, with 1.31.2, memory leakage still exists. it is not drastically, but stably leaking. In the picture, is a node running for about 36 hours, memory usage is growing from about 1% to 24%.

Screenshot 2024-12-01 at 18 20 03

@liyimeng
Copy link
Contributor Author

liyimeng commented Dec 4, 2024

@brandond Can we re-open this one?

@brandond
Copy link
Member

brandond commented Dec 4, 2024

Is this when the apiserver is available, or just under normal operation? ~900MB of memory doesn't seem particularly excessive. Note that golang's garbage collector won't aggressively free memory unless it has to, so what you're seeing isn't necessarily a leak or even unusual.

Please start the agent with --enable-pprof, and then grab a profile once the memory has increased:
kubectl get --server https://AGENT-IP:6443 --raw /debug/pprof/heap?seconds=120 > heap.out

@liyimeng
Copy link
Contributor Author

liyimeng commented Dec 4, 2024

This only happens when api server is offline(agent is not able to reach server). If I leave it in this situation, it will end up with an oom killed when time goes long enough.

While api server is online, the node run for month without any issue.

I will come back when I get the log collected. It will take a while, at least a couple of hours.

@brandond
Copy link
Member

brandond commented Dec 4, 2024

If you can, grab both total and a delta heap profiles

kubectl get --server https://AGENT-IP:6443 --raw /debug/pprof/heap > heap.out
kubectl get --server https://AGENT-IP:6443 --raw /debug/pprof/heap?seconds=120 > heap.delta.out

@liyimeng
Copy link
Contributor Author

liyimeng commented Jan 8, 2025

@brandond we continue testing with 1.31.4, problem seems gone 👯

@brandond
Copy link
Member

brandond commented Jan 8, 2025

🤷

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done Issue
Development

No branches or pull requests

2 participants