-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Containerd sudden restart stops pods from initializing #1716
Comments
It's odd that we're seeing |
It seems to be quite random, at times we might have 2/3 nodes in the span of 10 minutes having this issue, after 2 hours maybe 1 node, after 5 hours another node |
I'll check other healthy nodes that work and see if I can find the same behaviour in the journalctl of containerd |
did you find anything out ? |
In the end, I disabled the toolkit operator and it solved my issue. |
I missed the mention of GPU operator initially -- that will cause a restart of |
What happened:
Containerd sometimes stops responding and systemd commences a restart of the containerd service. Sometimes when this happens containers which should start running are stuck and kubelet receives the following error:
Mar 08 09:26:47 Error: error reading from server: read unix @->/run/containerd/containerd.sock: read: connection reset by peer
What you expected to happen:
The containers start properly
How to reproduce it (as minimally and precisely as possible):
It just happens sometimes, containerd stops working and systemd commences restart of the service
Anything else we need to know?:
I have set up gpu-operator-v23.9.1
I have tried looking through the available logs in journalctl and pod logs, but found nothing relevant to why containerd stops working and needs a restart
Environment:
5.10.192-183.736.amzn2.x86_64 #1 SMP Wed Sep 6 21:15:41 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
The text was updated successfully, but these errors were encountered: