-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
k3s in worker node(agent) get OOM if api server is not reachable for a while #11346
Comments
v1.28.9 is several months old, please update. |
Please upgrade to a more recent release and confirm if you still see the issue. v1.28 is end of life as of last month. |
@brandond Got it, I'll do the upgrade first. |
@brandond Bad news, with 1.31.2, memory leakage still exists. it is not drastically, but stably leaking. In the picture, is a node running for about 36 hours, memory usage is growing from about 1% to 24%. |
@brandond Can we re-open this one? |
Is this when the apiserver is available, or just under normal operation? ~900MB of memory doesn't seem particularly excessive. Note that golang's garbage collector won't aggressively free memory unless it has to, so what you're seeing isn't necessarily a leak or even unusual. Please start the agent with |
This only happens when api server is offline(agent is not able to reach server). If I leave it in this situation, it will end up with an oom killed when time goes long enough. While api server is online, the node run for month without any issue. I will come back when I get the log collected. It will take a while, at least a couple of hours. |
If you can, grab both total and a delta heap profiles
|
@brandond we continue testing with 1.31.4, problem seems gone 👯 |
🤷 |
Environmental Info:
K3s Version: 1.28.9
Node(s) CPU architecture, OS, and Version:
amd64, ubuntu 22.04
Cluster Configuration:
3 servers, 1 agent
Describe the bug:
When all server nodes go offline, agent not keep trying to reach api server, but failed. This is expected. However, if servers are not back on time, k3s agent will consume a significant ram and endup in the OOM killed, putting the traffics on the node into totally in-functional.
Steps To Reproduce:
To make the bug popup more quickly, you can use an agent node with load RAM.
Expected behavior:
Even api server is offline, agent node should keep running as it is, keep the pods on it running.
Because in upstream k8s design, pods will keep running and will restart if they crash but the API will not be available so it will not be possible to run anything new or change them.
Actual behavior:
k3s server use out the system memory, make node unstable, and k3s itself get oom killed, user load get removed after k3s get restarted.
Additional context / logs:
attach screenshot show k3s get killed with oom
The text was updated successfully, but these errors were encountered: