Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Containerd sudden restart stops pods from initializing #1716

Closed
Nightcro opened this issue Mar 8, 2024 · 6 comments
Closed

Containerd sudden restart stops pods from initializing #1716

Nightcro opened this issue Mar 8, 2024 · 6 comments

Comments

@Nightcro
Copy link

Nightcro commented Mar 8, 2024

What happened:
Containerd sometimes stops responding and systemd commences a restart of the containerd service. Sometimes when this happens containers which should start running are stuck and kubelet receives the following error:
Mar 08 09:26:47 Error: error reading from server: read unix @->/run/containerd/containerd.sock: read: connection reset by peer

What you expected to happen:
The containers start properly

How to reproduce it (as minimally and precisely as possible):
It just happens sometimes, containerd stops working and systemd commences restart of the service

Mar 08 09:26:39 i-node.eu-west-1.compute.internal containerd[3730]: time="2024-03-08T09:26:39.009195045Z" level=info msg="StartContainer for \"d4e5a065aa2b82e9e55d99a1e18ebc21612b6a47b9d88a2b78c254dcb88e305f\" returns successfully"
Mar 08 09:26:40 i-node.eu-west-1.compute.internal containerd[3730]: time="2024-03-08T09:26:40.558871785Z" level=info msg="TaskExit event container_id:\"42559dbacef4a6284a559cd375e07034d5acced691d0c5571a24a5be16613d4f\" id:\"42559dbacef4a6284a559cd375e07034d5acced691d0c5571a24a5be16613d4f\" pid:7044 exited_at:{seconds:1709889983 nanos:110845885}"
Mar 08 09:26:40 i-node.eu-west-1.compute.internal containerd[3730]: time="2024-03-08T09:26:40.559033340Z" level=info msg="Ensure that container 42559dbacef4a6284a559cd375e07034d5acced691d0c5571a24a5be16613d4f in task-service has been cleanup successfully"
Mar 08 09:26:40 i-node.eu-west-1.compute.internal containerd[3730]: time="2024-03-08T09:26:40.585191830Z" level=info msg="ImageCreate event name:\"nvcr.io/nvidia/k8s-device-plugin:v0.14.4\" labels:{key:\"io.cri-containerd.image\" value:\"managed\"}"
Mar 08 09:26:40 i-node.eu-west-1.compute.internal containerd[3730]: time="2024-03-08T09:26:40.587087543Z" level=info msg="stop pulling image nvcr.io/nvidia/k8s-device-plugin:v0.14.4: active requests=0, bytes read=122703857"
Mar 08 09:26:40 i-node.eu-west-1.compute.internal containerd[3730]: time="2024-03-08T09:26:40.589296373Z" level=info msg="ImageCreate event name:\"sha256:0745b508898e2aa68f29a3c7f21023d03feace165b2430bc2297d250e65009e0\" labels:{key:\"io.cri-containerd.image\" value:\"managed\"}"
Mar 08 09:26:40 i-node.eu-west-1.compute.internal containerd[3730]: time="2024-03-08T09:26:40.593900328Z" level=info msg="ImageUpdate event name:\"nvcr.io/nvidia/k8s-device-plugin:v0.14.4\" labels:{key:\"io.cri-containerd.image\" value:\"managed\"}"
Mar 08 09:26:40 i-node.eu-west-1.compute.internal containerd[3730]: time="2024-03-08T09:26:40.597983400Z" level=info msg="ImageCreate event name:\"nvcr.io/nvidia/k8s-device-plugin@sha256:2388c1f792daf3e810a6b43cdf709047183b50f5ec3ed476fae6aa0a07e68acc\" labels:{key:\"io.cri-containerd.image\" value:\"managed\"}"
Mar 08 09:26:40 i-node.eu-west-1.compute.internal containerd[3730]: time="2024-03-08T09:26:40.600021734Z" level=info msg="Pulled image \"nvcr.io/nvidia/k8s-device-plugin:v0.14.4\" with image id \"sha256:0745b508898e2aa68f29a3c7f21023d03feace165b2430bc2297d250e65009e0\", repo tag \"nvcr.io/nvidia/k8s-device-plugin:v0.14.4\", repo digest \"nvcr.io/nv
Mar 08 09:26:40 i-node.eu-west-1.compute.internal containerd[3730]: time="2024-03-08T09:26:40.600066334Z" level=info msg="PullImage \"nvcr.io/nvidia/k8s-device-plugin:v0.14.4\" returns image reference \"sha256:0745b508898e2aa68f29a3c7f21023d03feace165b2430bc2297d250e65009e0\""
Mar 08 09:26:40 i-node.eu-west-1.compute.internal containerd[3730]: time="2024-03-08T09:26:40.602778083Z" level=info msg="CreateContainer within sandbox \"987aa7badab0e155ce1eedb86b585c612c500069fd3bf035b3ce56733381abf4\" for container &ContainerMetadata{Name:config-manager-init,Attempt:0,}"
Mar 08 09:26:40 i-node.eu-west-1.compute.internal containerd[3730]: time="2024-03-08T09:26:40.607551842Z" level=info msg="shim reaped" error="<nil>" id=c40672482a90ec7bb1a4565f38e67d688f31c9b112a74666ca2bdf9d99b7b0fd namespace=k8s.io
Mar 08 09:26:40 i-node.eu-west-1.compute.internal containerd[3730]: time="2024-03-08T09:26:40.630196168Z" level=info msg="CreateContainer within sandbox \"987aa7badab0e155ce1eedb86b585c612c500069fd3bf035b3ce56733381abf4\" for &ContainerMetadata{Name:config-manager-init,Attempt:0,} returns container id \"f671e12d5dfd26abca6b4d9c4e8a20edb9146aa6f203d
Mar 08 09:26:40 i-node.eu-west-1.compute.internal containerd[3730]: time="2024-03-08T09:26:40.630610838Z" level=info msg="StartContainer for \"f671e12d5dfd26abca6b4d9c4e8a20edb9146aa6f203ddb12206c699ad0079a9\""
Mar 08 09:26:40 i-node.eu-west-1.compute.internal containerd[3730]: time="2024-03-08T09:26:40.631072744Z" level=warning msg="\"io.containerd.runtime.v1.linux\" is deprecated since containerd v1.4 and will be removed in containerd v2.0, use \"io.containerd.runc.v2\" instead"
Mar 08 09:26:40 i-node.eu-west-1.compute.internal containerd[3730]: time="2024-03-08T09:26:40.631776013Z" level=info msg="shim containerd-shim started" address="unix:///run/containerd/s/d7f0c28ba727aff56afd7a9a678052a171b6115629aaca72791e5ddb575b984b" debug=false error="<nil>" id=f671e12d5dfd26abca6b4d9c4e8a20edb9146aa6f203ddb12206c699ad0079a9 name
Mar 08 09:26:40 i-node.eu-west-1.compute.internal containerd[3730]: time="2024-03-08T09:26:40.922054030Z" level=info msg="PullImage \"nvcr.io/nvidia/k8s/dcgm-exporter:3.3.0-3.2.0-ubuntu22.04\""
Mar 08 09:26:41 i-node.eu-west-1.compute.internal containerd[3730]: time="2024-03-08T09:26:41.559075071Z" level=info msg="TaskExit event container_id:\"6498264b55ff945fd6798c81e9eeb1f5b6f61532d54a2fbf53f872262ed00311\" id:\"6498264b55ff945fd6798c81e9eeb1f5b6f61532d54a2fbf53f872262ed00311\" pid:7032 exited_at:{seconds:1709889983 nanos:132245534}"
Mar 08 09:26:41 i-node.eu-west-1.compute.internal containerd[3730]: time="2024-03-08T09:26:41.559250077Z" level=info msg="Ensure that container 6498264b55ff945fd6798c81e9eeb1f5b6f61532d54a2fbf53f872262ed00311 in task-service has been cleanup successfully"
Mar 08 09:26:41 i-node.eu-west-1.compute.internal containerd[3730]: time="2024-03-08T09:26:41.935339729Z" level=info msg="CreateContainer within sandbox \"df0a703ed60bddda5dbd2294fc4a0fbc085ee8fe6545936e103baa31e2617743\" for container &ContainerMetadata{Name:config-manager-init,Attempt:0,}"
Mar 08 09:26:42 i-node.eu-west-1.compute.internal containerd[3730]: time="2024-03-08T09:26:42.668431383Z" level=info msg="CreateContainer within sandbox \"df0a703ed60bddda5dbd2294fc4a0fbc085ee8fe6545936e103baa31e2617743\" for &ContainerMetadata{Name:config-manager-init,Attempt:0,} returns container id \"0013adbc3ff0a8571a7e4e66d06b141d7e014b8a096b8
Mar 08 09:26:42 i-node.eu-west-1.compute.internal containerd[3730]: time="2024-03-08T09:26:42.668960325Z" level=info msg="StartContainer for \"0013adbc3ff0a8571a7e4e66d06b141d7e014b8a096b82ba5cfcb7a0a32a0ea7\""
Mar 08 09:26:42 i-node.eu-west-1.compute.internal containerd[3730]: time="2024-03-08T09:26:42.669480408Z" level=warning msg="\"io.containerd.runtime.v1.linux\" is deprecated since containerd v1.4 and will be removed in containerd v2.0, use \"io.containerd.runc.v2\" instead"
Mar 08 09:26:42 i-node.eu-west-1.compute.internal containerd[3730]: time="2024-03-08T09:26:42.670233245Z" level=info msg="shim containerd-shim started" address="unix:///run/containerd/s/1d217093a15d1d5d3f492f42d83dd9a75cb834ebb94fdcf69e47d62ecf3e3e40" debug=false error="<nil>" id=0013adbc3ff0a8571a7e4e66d06b141d7e014b8a096b82ba5cfcb7a0a32a0ea7 name
Mar 08 09:26:52 i-node.eu-west-1.compute.internal systemd[1]: containerd.service holdoff time over, scheduling restart.
Mar 08 09:26:52 i-node.eu-west-1.compute.internal systemd[1]: Stopped containerd container runtime.
Mar 08 09:26:52 i-node.eu-west-1.compute.internal systemd[1]: Starting containerd container runtime...
Mar 08 09:26:52 i-node.eu-west-1.compute.internal containerd[8234]: time="2024-03-08T09:26:52Z" level=warning msg="containerd config version `1` has been deprecated and will be removed in containerd v2.0, please switch to version `2`, see https://github.com/containerd/containerd/blob/main/docs/PLUGINS.md#version-header"
Mar 08 09:26:52 i-node.eu-west-1.compute.internal containerd[8234]: time="2024-03-08T09:26:52.429617534Z" level=info msg="starting containerd" revision=64b8a811b07ba6288238eefc14d898ee0b5b99ba version=1.7.11
Mar 08 09:26:52 i-node.eu-west-1.compute.internal containerd[8234]: time="2024-03-08T09:26:52.449569749Z" level=info msg="loading plugin \"io.containerd.warning.v1.deprecations\"..." type=io.containerd.warning.v1
Mar 08 09:26:52 i-node.eu-west-1.compute.internal containerd[8234]: time="2024-03-08T09:26:52.449600938Z" level=info msg="loading plugin \"io.containerd.snapshotter.v1.aufs\"..." type=io.containerd.snapshotter.v1

Anything else we need to know?:
I have set up gpu-operator-v23.9.1
I have tried looking through the available logs in journalctl and pod logs, but found nothing relevant to why containerd stops working and needs a restart

Environment:

  • AWS Region: eu-west-1
  • Instance Type(s): g4dn.xlarge
  • EKS Platform version: eks.1
  • Kubernetes version: 1.29
  • AMI Version: amazon-eks-gpu-node-1.29-v20240227
  • Kernel: 5.10.192-183.736.amzn2.x86_64 #1 SMP Wed Sep 6 21:15:41 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
  • Release information:
BASE_AMI_ID="ami-059705a71ed021143"
BUILD_TIME="Tue Feb 13 05:12:29 UTC 2024"
BUILD_KERNEL="5.10.192-183.736.amzn2.x86_64"
ARCH="x86_64"
@cartermckinnon
Copy link
Member

It's odd that we're seeing holdoff time over, scheduling restart from systemd but no logs from the actual containerd crash. How often does this happen?

@Nightcro
Copy link
Author

Nightcro commented Mar 8, 2024

It seems to be quite random, at times we might have 2/3 nodes in the span of 10 minutes having this issue, after 2 hours maybe 1 node, after 5 hours another node
I have not been able to pinpoint what is wrong
The fact that I can not find anything in the logs seems really odd. At the moment we don't have any other workload to test against, these nodes that have GPU are the only ones that scale up and down for us. I am not sure if it is a combination of gpu-operator and the nodes or if it could be happening on other nodes as well

@Nightcro
Copy link
Author

Nightcro commented Mar 8, 2024

I'll check other healthy nodes that work and see if I can find the same behaviour in the journalctl of containerd

@tl-alex-nicot
Copy link

did you find anything out ?

@Nightcro
Copy link
Author

In the end, I disabled the toolkit operator and it solved my issue.

@cartermckinnon
Copy link
Member

I missed the mention of GPU operator initially -- that will cause a restart of containerd after the operator modifies the containerd config to use the NVIDIA runtime (runc shim/fork). The GPU operator isn't necessary or helpful if you use our GPU AMI variants 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants