Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

containerd support not working #699

Closed
8enmann opened this issue Jul 16, 2021 · 14 comments
Closed

containerd support not working #699

8enmann opened this issue Jul 16, 2021 · 14 comments

Comments

@8enmann
Copy link

8enmann commented Jul 16, 2021

What happened:

  • Built a new ami using HEAD and make 1.20 based on Containerd runtime support #698 by @ravisinha0506
  • Built another ami on top of that using packer and a custom build script following these gvisor instructions
  • Launched a node with bootstrap args --container-runtime containerd
  • Launched a simple python-slim busybox pod on the new node

I manually patched 3 files in the node, which can be seen here
Summary:

  1. change to --container-runtime-endpoint "unix:///run/containerd/containerd.sock" from dockershim
  2. add /etc/cni/net.d/10-aws.conflist from a dockerd node, since the directory was empty otherwise
  3. update /etc/containerd/config.toml to point to containerd and add runsc plugin

After making these changes I restarted the kubelet and everything worked (pod came up and was usable). However when I baked the image with these changes, I got the following error:

  Warning  FailedCreatePodSandBox  6m49s                  kubelet             Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "9527c747728a1a4a4189f242e8df82eefecd76a94142318af5023a2fc88974f5": failed to find plugin "aws-cni" in path [/opt/cni/bin]
  Warning  FailedCreatePodSandBox  6m8s                   kubelet             Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "741035515d411c8d65b9654a345359dc5ea445e96bc33ebe91f85e61192c5b4a": add cmd: Error received from AddNetwork gRPC call: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:50051: connect: connection refused"

When I ssh into the node, /opt/cni/bin has all the binaries you'd expect, including aws-cni.

I ran /opt/cni/bin/aws-cni-support.sh and can make the output available privately upon request.

Environment:

  • AWS Region: us-east-1
  • Instance Type(s): r5.24xlarge
  • EKS Platform version (use aws eks describe-cluster --name <name> --query cluster.platformVersion): "eks.1"
  • Kubernetes version (use aws eks describe-cluster --name <name> --query cluster.version): 1.20
  • AMI Version: built from af6a02d
  • Kernel (e.g. uname -a): Linux ip-10-0-192-37.ec2.internal 5.4.129-62.227.amzn2.x86_64 #1 SMP Wed Jul 7 00:08:43 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
  • Release information (run cat /etc/eks/release on a node):
@ravisinha0506
Copy link
Contributor

ravisinha0506 commented Jul 19, 2021

Hi @8enmann,
Could you update container endpoint as mentioned below and let us know if you are still facing this issue:

--container-runtime-endpoint unix:///run/dockershim.sock

The problem could be related to cni default mounted path which points to unix:///run/dockershim.sock.

@8enmann
Copy link
Author

8enmann commented Jul 20, 2021

Fixed!!
I had to add a new daemonset with the modified values

                      - mountPath: /var/run/cri.sock
                        name: dockershim
 ...
                - hostPath:
                      path: /run/containerd/containerd.sock                        

I also added a nodeselector for my new daemonset that's a strict complement of the old one so that I could have both node types continue to coexist in the cluster. I added the new label to my ASG template so new nodes would have the new label.
complete aws-node-cni-cri daemonset yaml: https://gist.github.com/8enmann/39519f02ded2a2cce8e673f9905d7ef8
Updated ami patch (no longer needs conflist): https://gist.github.com/8enmann/2124361516a5c7709a81efdc63321cb5

@ravisinha0506
Copy link
Contributor

@8enmann Thanks for sharing this.
However, did you try using unix:///run/dockershim.sock as container-runtime-endpoint path in kubelet and grpc in containerd-config.toml? With this approach, we could avoid deploying another daemonset with modified values?

@kpanic9
Copy link

kpanic9 commented Jul 27, 2021

Hi @ravisinha0506 ,
I'm having a similar issue when starting the kubelet on AMI with containerd. It seems kubelet is looking for /run/dockershim.sock and failing to start because that file is not present.

grpc: addrConn.createTransport failed to connect to {/run/dockershim.sock  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial unix /run/dockershim.sock: connect: no such file or directory". Reconnecting...

I can confirm that kubelet service and containerd config are configured like that. But do we need to use the dockershim even after moving to containerd? The changes proposed by @8enmann solves the issue. So should they be included in the AMI?

Thanks,
Namesh

@TBBle
Copy link

TBBle commented Jul 27, 2021

The suggestion from @ravisinha0506 is not to keep using Dockershim, but to keep using the socket named dockerhsim.sock, so that the Daemonset config doesn't need to differ between containerd and dockershim, just what's listening on the other end of that socket.

@ravisinha0506
Copy link
Contributor

ravisinha0506 commented Jul 27, 2021

@kpanic9 We don't need to use dockershim with the latest ami. We are re-using socket name dockershim.sock to avoid any changes in the Daemonset to differentiate between different container-runtime running on the node.
Just pass an additional bootstrap argument i.e; --container-runtime containerd and that should start containerd on the node. Let us know if you are still facing any issue.

@kpanic9
Copy link

kpanic9 commented Jul 28, 2021

Hi @TBBle and @ravisinha0506 ,

Thanks you your quick response on the issue.
To give an overview of what happened with me, I have an EKS cluster with version 1.20 in ap-southeast-2 region. I used EKS optimized AMI (ami-0718ef1c4a20afb11) provided by AWS and using our AMI baking process installed few tools on it and created a custom AMI. Then I used this custom AMI to create new worker nodes and made sure --container-runtime containerd is passed to eks worker node bootstrap script in userdata for worker nodes. However new worker nodes didn't join the cluster and I found that kubelet process keeps crashing with below log message.

grpc: addrConn.createTransport failed to connect to {/run/dockershim.sock  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial unix /run/dockershim.sock: connect: no such file or directory". Reconnecting...

Further investigation showed that kubelet.service is configured to use /run/dockershim.sock for taking to container runtime. This /run/dockershim.sock socket is not present on the worker node.

ExecStart=/usr/bin/kubelet --cloud-provider aws \
    --config /etc/kubernetes/kubelet/kubelet-config.json \
    --kubeconfig /var/lib/kubelet/kubeconfig \
    --container-runtime remote \
    --container-runtime-endpoint unix:///run/dockershim.sock \
    --network-plugin cni $KUBELET_ARGS $KUBELET_EXTRA_ARGS

Also on new eks worker nodes containerd config file is configured as below,

version = 2
root = "/var/lib/containerd"
state = "/run/containerd"

[grpc]
address = "/run/dockershim.sock"

[plugins."io.containerd.grpc.v1.cri".containerd]
default_runtime_name = "runc"

[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
runtime_type = "io.containerd.runc.v2"

[plugins."io.containerd.grpc.v1.cri".cni]
bin_dir = "/opt/cni/bin"
conf_dir = "/etc/cni/net.d"

But if I replace the /run/dockershim.sock in both config files and restart containerd and kubelet, worker node joins the eks cluster. Please let me know I have done anything wrong or what needs to be changed from my side to the workers with containerd can join the cluster without manual intervention.

Thanks,
Namesh

@ravisinha0506
Copy link
Contributor

ravisinha0506 commented Jul 28, 2021

Hi @kpanic9,

  • Do you see containerd coming up on the node when kubelet fails? Could you share the outout of systemctl status containerd.
  • Could you please give a try using EKS ami directly to see if containerd nodes are coming up properly? Just pass --container-runtime containerd and that should be it.

@kpanic9
Copy link

kpanic9 commented Jul 29, 2021

Hi @ravisinha0506 ,

containerd starts without any issues on the nodes failing to join the cluster. Below is the output from 'systemctl status containerd' on one of the nodes.

● containerd.service - containerd container runtime
   Loaded: loaded (/usr/lib/systemd/system/containerd.service; enabled; vendor preset: disabled)
   Active: active (running) since Thu 2021-07-29 13:08:12 AEST; 1min 8s ago
     Docs: https://containerd.io
 Main PID: 3473 (containerd)
    Tasks: 15
   Memory: 62.0M
   CGroup: /system.slice/containerd.service
           └─3473 /usr/bin/containerd

Jul 29 13:08:12 ip-192-168-73-148.ap-southeast-2.compute.internal containerd[3473]: time="2021-07-29T13:08:12.176876802+10:00" level=info msg="Start subscribing containerd event"
Jul 29 13:08:12 ip-192-168-73-148.ap-southeast-2.compute.internal containerd[3473]: time="2021-07-29T13:08:12.177408640+10:00" level=info msg="Start recovering state"
Jul 29 13:08:12 ip-192-168-73-148.ap-southeast-2.compute.internal containerd[3473]: time="2021-07-29T13:08:12.177823567+10:00" level=info msg="Start event monitor"
Jul 29 13:08:12 ip-192-168-73-148.ap-southeast-2.compute.internal containerd[3473]: time="2021-07-29T13:08:12.177979709+10:00" level=info msg="Start snapshots syncer"
Jul 29 13:08:12 ip-192-168-73-148.ap-southeast-2.compute.internal containerd[3473]: time="2021-07-29T13:08:12.177998939+10:00" level=info msg="Start cni network conf syncer"
Jul 29 13:08:12 ip-192-168-73-148.ap-southeast-2.compute.internal containerd[3473]: time="2021-07-29T13:08:12.178008740+10:00" level=info msg="Start streaming server"
Jul 29 13:08:12 ip-192-168-73-148.ap-southeast-2.compute.internal containerd[3473]: time="2021-07-29T13:08:12.178350055+10:00" level=info msg=serving... address=/run/containerd/containerd.sock.ttrpc
Jul 29 13:08:12 ip-192-168-73-148.ap-southeast-2.compute.internal containerd[3473]: time="2021-07-29T13:08:12.178455527+10:00" level=info msg=serving... address=/run/containerd/containerd.sock
Jul 29 13:08:12 ip-192-168-73-148.ap-southeast-2.compute.internal containerd[3473]: time="2021-07-29T13:08:12.178643029+10:00" level=info msg="containerd successfully booted in 0.062660s"
Jul 29 13:08:12 ip-192-168-73-148.ap-southeast-2.compute.internal systemd[1]: Started containerd container runtime.

Also I tried to use the latest eks AMI from AWS (ID: ami-0718ef1c4a20afb11), but it doesn't solve the issue and workers are still failing to join the cluster with same error about message in the kubelet logs.

Thanks,

@TBBle
Copy link

TBBle commented Jul 29, 2021

Jul 29 13:08:12 ip-192-168-73-148.ap-southeast-2.compute.internal containerd[3473]: time="2021-07-29T13:08:12.178455527+10:00" level=info msg=serving... address=/run/containerd/containerd.sock

This is probably the problem. The config.toml applied by --container-runtime containerd is supposed to change that to /run/dockerhsim.sock. And that setting is visible in your earlier post, so it did copy the file correctly.

Is it possible that containerd was started before bootstrap.sh had run, and so was running without the updated config? If you restart containerd, does it then report the correct serving... address?

@ravisinha0506
Copy link
Contributor

Looking at the containerd logs, it looks like its using default containerd path:

Jul 29 13:08:12 ip-192-168-73-148.ap-southeast-2.compute.internal containerd[3473]: time="2021-07-29T13:08:12.178455527+10:00" level=info msg=serving... address=/run/containerd/containerd.sock

This can happen if one of tools which you have installed might have dependency on containerd to be up and running. Could you check /var/log/cloud-init-output.log and see if the log timestamp indicates that the containerd was launched prior to bootstrap.sh invocation?

Also, could you please share the error log from EKS optimized AMI with --container-runtime containerd param along with the AMI used?

@kpanic9
Copy link

kpanic9 commented Aug 2, 2021

Hi @ravisinha0506 and @TBBle ,

You guys were correct. In the userdata containerd was starting before executing the bootstrap script. I have updated the userdata script for worker nodes and after that containerd worker nodes seem to startup with out any issues.
Thank you very much for the assistance.

Thanks,

@TBBle
Copy link

TBBle commented Aug 3, 2021

I don't know systemd off-hand, but maybe the systemctl start containerd needs to be systemctl restart containerd so that if containerd is already running, it restarts and sees the config file change. I see this is already the case for systemctl restart docker in the non-containerd code-path.

@ravisinha0506
Copy link
Contributor

All the issues discussed in this thread should be fixed with the latest EKS AMI post v20211013 release. Closing this issue. Please feel free to create a new one or re-open if there are any other concerns.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants