containerd support not working #699

8enmann · 2021-07-16T18:12:59Z

What happened:

Built a new ami using HEAD and make 1.20 based on Containerd runtime support #698 by @ravisinha0506
Built another ami on top of that using packer and a custom build script following these gvisor instructions
Launched a node with bootstrap args --container-runtime containerd
Launched a simple python-slim busybox pod on the new node

I manually patched 3 files in the node, which can be seen here
Summary:

change to --container-runtime-endpoint "unix:///run/containerd/containerd.sock" from dockershim
add /etc/cni/net.d/10-aws.conflist from a dockerd node, since the directory was empty otherwise
update /etc/containerd/config.toml to point to containerd and add runsc plugin

After making these changes I restarted the kubelet and everything worked (pod came up and was usable). However when I baked the image with these changes, I got the following error:

  Warning  FailedCreatePodSandBox  6m49s                  kubelet             Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "9527c747728a1a4a4189f242e8df82eefecd76a94142318af5023a2fc88974f5": failed to find plugin "aws-cni" in path [/opt/cni/bin]
  Warning  FailedCreatePodSandBox  6m8s                   kubelet             Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "741035515d411c8d65b9654a345359dc5ea445e96bc33ebe91f85e61192c5b4a": add cmd: Error received from AddNetwork gRPC call: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:50051: connect: connection refused"

When I ssh into the node, /opt/cni/bin has all the binaries you'd expect, including aws-cni.

I ran /opt/cni/bin/aws-cni-support.sh and can make the output available privately upon request.

Environment:

AWS Region: us-east-1
Instance Type(s): r5.24xlarge
EKS Platform version (use aws eks describe-cluster --name <name> --query cluster.platformVersion): "eks.1"
Kubernetes version (use aws eks describe-cluster --name <name> --query cluster.version): 1.20
AMI Version: built from af6a02d
Kernel (e.g. uname -a): Linux ip-10-0-192-37.ec2.internal 5.4.129-62.227.amzn2.x86_64 #1 SMP Wed Jul 7 00:08:43 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
Release information (run cat /etc/eks/release on a node):

The text was updated successfully, but these errors were encountered:

ravisinha0506 · 2021-07-19T17:22:51Z

Hi @8enmann,
Could you update container endpoint as mentioned below and let us know if you are still facing this issue:

--container-runtime-endpoint unix:///run/dockershim.sock

The problem could be related to cni default mounted path which points to unix:///run/dockershim.sock.

8enmann · 2021-07-20T06:54:45Z

Fixed!!
I had to add a new daemonset with the modified values

                      - mountPath: /var/run/cri.sock
                        name: dockershim
 ...
                - hostPath:
                      path: /run/containerd/containerd.sock

I also added a nodeselector for my new daemonset that's a strict complement of the old one so that I could have both node types continue to coexist in the cluster. I added the new label to my ASG template so new nodes would have the new label.
complete aws-node-cni-cri daemonset yaml: https://gist.github.com/8enmann/39519f02ded2a2cce8e673f9905d7ef8
Updated ami patch (no longer needs conflist): https://gist.github.com/8enmann/2124361516a5c7709a81efdc63321cb5

ravisinha0506 · 2021-07-21T05:55:20Z

@8enmann Thanks for sharing this.
However, did you try using unix:///run/dockershim.sock as container-runtime-endpoint path in kubelet and grpc in containerd-config.toml? With this approach, we could avoid deploying another daemonset with modified values?

kpanic9 · 2021-07-27T00:05:36Z

Hi @ravisinha0506 ,
I'm having a similar issue when starting the kubelet on AMI with containerd. It seems kubelet is looking for /run/dockershim.sock and failing to start because that file is not present.

grpc: addrConn.createTransport failed to connect to {/run/dockershim.sock  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial unix /run/dockershim.sock: connect: no such file or directory". Reconnecting...

I can confirm that kubelet service and containerd config are configured like that. But do we need to use the dockershim even after moving to containerd? The changes proposed by @8enmann solves the issue. So should they be included in the AMI?

Thanks,
Namesh

TBBle · 2021-07-27T11:14:01Z

The suggestion from @ravisinha0506 is not to keep using Dockershim, but to keep using the socket named dockerhsim.sock, so that the Daemonset config doesn't need to differ between containerd and dockershim, just what's listening on the other end of that socket.

ravisinha0506 · 2021-07-27T17:59:19Z

@kpanic9 We don't need to use dockershim with the latest ami. We are re-using socket name dockershim.sock to avoid any changes in the Daemonset to differentiate between different container-runtime running on the node.
Just pass an additional bootstrap argument i.e; --container-runtime containerd and that should start containerd on the node. Let us know if you are still facing any issue.

kpanic9 · 2021-07-28T05:44:26Z

Hi @TBBle and @ravisinha0506 ,

Thanks you your quick response on the issue.
To give an overview of what happened with me, I have an EKS cluster with version 1.20 in ap-southeast-2 region. I used EKS optimized AMI (ami-0718ef1c4a20afb11) provided by AWS and using our AMI baking process installed few tools on it and created a custom AMI. Then I used this custom AMI to create new worker nodes and made sure --container-runtime containerd is passed to eks worker node bootstrap script in userdata for worker nodes. However new worker nodes didn't join the cluster and I found that kubelet process keeps crashing with below log message.

grpc: addrConn.createTransport failed to connect to {/run/dockershim.sock  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial unix /run/dockershim.sock: connect: no such file or directory". Reconnecting...

Further investigation showed that kubelet.service is configured to use /run/dockershim.sock for taking to container runtime. This /run/dockershim.sock socket is not present on the worker node.

ExecStart=/usr/bin/kubelet --cloud-provider aws \
    --config /etc/kubernetes/kubelet/kubelet-config.json \
    --kubeconfig /var/lib/kubelet/kubeconfig \
    --container-runtime remote \
    --container-runtime-endpoint unix:///run/dockershim.sock \
    --network-plugin cni $KUBELET_ARGS $KUBELET_EXTRA_ARGS

Also on new eks worker nodes containerd config file is configured as below,

version = 2
root = "/var/lib/containerd"
state = "/run/containerd"

[grpc]
address = "/run/dockershim.sock"

[plugins."io.containerd.grpc.v1.cri".containerd]
default_runtime_name = "runc"

[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
runtime_type = "io.containerd.runc.v2"

[plugins."io.containerd.grpc.v1.cri".cni]
bin_dir = "/opt/cni/bin"
conf_dir = "/etc/cni/net.d"

But if I replace the /run/dockershim.sock in both config files and restart containerd and kubelet, worker node joins the eks cluster. Please let me know I have done anything wrong or what needs to be changed from my side to the workers with containerd can join the cluster without manual intervention.

Thanks,
Namesh

ravisinha0506 · 2021-07-28T23:26:35Z

Hi @kpanic9,

Do you see containerd coming up on the node when kubelet fails? Could you share the outout of systemctl status containerd.
Could you please give a try using EKS ami directly to see if containerd nodes are coming up properly? Just pass --container-runtime containerd and that should be it.

kpanic9 · 2021-07-29T05:21:06Z

Hi @ravisinha0506 ,

containerd starts without any issues on the nodes failing to join the cluster. Below is the output from 'systemctl status containerd' on one of the nodes.

● containerd.service - containerd container runtime
   Loaded: loaded (/usr/lib/systemd/system/containerd.service; enabled; vendor preset: disabled)
   Active: active (running) since Thu 2021-07-29 13:08:12 AEST; 1min 8s ago
     Docs: https://containerd.io
 Main PID: 3473 (containerd)
    Tasks: 15
   Memory: 62.0M
   CGroup: /system.slice/containerd.service
           └─3473 /usr/bin/containerd

Jul 29 13:08:12 ip-192-168-73-148.ap-southeast-2.compute.internal containerd[3473]: time="2021-07-29T13:08:12.176876802+10:00" level=info msg="Start subscribing containerd event"
Jul 29 13:08:12 ip-192-168-73-148.ap-southeast-2.compute.internal containerd[3473]: time="2021-07-29T13:08:12.177408640+10:00" level=info msg="Start recovering state"
Jul 29 13:08:12 ip-192-168-73-148.ap-southeast-2.compute.internal containerd[3473]: time="2021-07-29T13:08:12.177823567+10:00" level=info msg="Start event monitor"
Jul 29 13:08:12 ip-192-168-73-148.ap-southeast-2.compute.internal containerd[3473]: time="2021-07-29T13:08:12.177979709+10:00" level=info msg="Start snapshots syncer"
Jul 29 13:08:12 ip-192-168-73-148.ap-southeast-2.compute.internal containerd[3473]: time="2021-07-29T13:08:12.177998939+10:00" level=info msg="Start cni network conf syncer"
Jul 29 13:08:12 ip-192-168-73-148.ap-southeast-2.compute.internal containerd[3473]: time="2021-07-29T13:08:12.178008740+10:00" level=info msg="Start streaming server"
Jul 29 13:08:12 ip-192-168-73-148.ap-southeast-2.compute.internal containerd[3473]: time="2021-07-29T13:08:12.178350055+10:00" level=info msg=serving... address=/run/containerd/containerd.sock.ttrpc
Jul 29 13:08:12 ip-192-168-73-148.ap-southeast-2.compute.internal containerd[3473]: time="2021-07-29T13:08:12.178455527+10:00" level=info msg=serving... address=/run/containerd/containerd.sock
Jul 29 13:08:12 ip-192-168-73-148.ap-southeast-2.compute.internal containerd[3473]: time="2021-07-29T13:08:12.178643029+10:00" level=info msg="containerd successfully booted in 0.062660s"
Jul 29 13:08:12 ip-192-168-73-148.ap-southeast-2.compute.internal systemd[1]: Started containerd container runtime.

Also I tried to use the latest eks AMI from AWS (ID: ami-0718ef1c4a20afb11), but it doesn't solve the issue and workers are still failing to join the cluster with same error about message in the kubelet logs.

Thanks,

TBBle · 2021-07-29T09:46:35Z

Jul 29 13:08:12 ip-192-168-73-148.ap-southeast-2.compute.internal containerd[3473]: time="2021-07-29T13:08:12.178455527+10:00" level=info msg=serving... address=/run/containerd/containerd.sock

This is probably the problem. The config.toml applied by --container-runtime containerd is supposed to change that to /run/dockerhsim.sock. And that setting is visible in your earlier post, so it did copy the file correctly.

Is it possible that containerd was started before bootstrap.sh had run, and so was running without the updated config? If you restart containerd, does it then report the correct serving... address?

ravisinha0506 · 2021-07-30T05:29:30Z

Looking at the containerd logs, it looks like its using default containerd path:

Jul 29 13:08:12 ip-192-168-73-148.ap-southeast-2.compute.internal containerd[3473]: time="2021-07-29T13:08:12.178455527+10:00" level=info msg=serving... address=/run/containerd/containerd.sock

This can happen if one of tools which you have installed might have dependency on containerd to be up and running. Could you check /var/log/cloud-init-output.log and see if the log timestamp indicates that the containerd was launched prior to bootstrap.sh invocation?

Also, could you please share the error log from EKS optimized AMI with --container-runtime containerd param along with the AMI used?

kpanic9 · 2021-08-02T23:34:06Z

Hi @ravisinha0506 and @TBBle ,

You guys were correct. In the userdata containerd was starting before executing the bootstrap script. I have updated the userdata script for worker nodes and after that containerd worker nodes seem to startup with out any issues.
Thank you very much for the assistance.

Thanks,

TBBle · 2021-08-03T00:05:34Z

I don't know systemd off-hand, but maybe the systemctl start containerd needs to be systemctl restart containerd so that if containerd is already running, it restarts and sees the config file change. I see this is already the case for systemctl restart docker in the non-containerd code-path.

ravisinha0506 · 2021-12-23T21:25:22Z

All the issues discussed in this thread should be fixed with the latest EKS AMI post v20211013 release. Closing this issue. Please feel free to create a new one or re-open if there are any other concerns.

8enmann mentioned this issue Jul 29, 2021

DNS fails on gVisor using netstack on EKS google/gvisor#3301

Closed

benmccown mentioned this issue Aug 6, 2021

containerd's cli ctr is not pointing at /run/dockershim.sock by default #726

Closed

ravisinha0506 closed this as completed Dec 23, 2021

This was referenced May 3, 2022

Some pods not working properly with containerd runtime and CNI plugin. #911

Closed

Some pods not working properly with containerd runtime and CNI plugin. aws/amazon-vpc-cni-k8s#1982

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

containerd support not working #699

containerd support not working #699

8enmann commented Jul 16, 2021 •

edited

Loading

ravisinha0506 commented Jul 19, 2021 •

edited

Loading

8enmann commented Jul 20, 2021

ravisinha0506 commented Jul 21, 2021

kpanic9 commented Jul 27, 2021 •

edited

Loading

TBBle commented Jul 27, 2021

ravisinha0506 commented Jul 27, 2021 •

edited

Loading

kpanic9 commented Jul 28, 2021

ravisinha0506 commented Jul 28, 2021 •

edited

Loading

kpanic9 commented Jul 29, 2021

TBBle commented Jul 29, 2021

ravisinha0506 commented Jul 30, 2021

kpanic9 commented Aug 2, 2021

TBBle commented Aug 3, 2021 •

edited

Loading

ravisinha0506 commented Dec 23, 2021

containerd support not working #699

containerd support not working #699

Comments

8enmann commented Jul 16, 2021 • edited Loading

ravisinha0506 commented Jul 19, 2021 • edited Loading

8enmann commented Jul 20, 2021

ravisinha0506 commented Jul 21, 2021

kpanic9 commented Jul 27, 2021 • edited Loading

TBBle commented Jul 27, 2021

ravisinha0506 commented Jul 27, 2021 • edited Loading

kpanic9 commented Jul 28, 2021

ravisinha0506 commented Jul 28, 2021 • edited Loading

kpanic9 commented Jul 29, 2021

TBBle commented Jul 29, 2021

ravisinha0506 commented Jul 30, 2021

kpanic9 commented Aug 2, 2021

TBBle commented Aug 3, 2021 • edited Loading

ravisinha0506 commented Dec 23, 2021

8enmann commented Jul 16, 2021 •

edited

Loading

ravisinha0506 commented Jul 19, 2021 •

edited

Loading

kpanic9 commented Jul 27, 2021 •

edited

Loading

ravisinha0506 commented Jul 27, 2021 •

edited

Loading

ravisinha0506 commented Jul 28, 2021 •

edited

Loading

TBBle commented Aug 3, 2021 •

edited

Loading