Setting up containerd with eksctl (with preBootstrapCommands) #3572

RobertLucian · 2021-04-13T22:46:35Z

Already went through most tickets that talked about the following approach tangentially but couldn't find anything helpful.

High level

version: 0.40.0
cni: 1.7.2

We're looking to use containerd as the container runtime on our clusters. The way we thought of doing that is by adding following commands to the preBootstrapCommands section:

preBootstrapCommands:
 - yum install containerd -y
 - truncate -s-1 /etc/systemd/system/kubelet.service.d/10-eksclt.al2.conf
 - echo -n ' --container-runtime=remote --container-runtime-endpoint=unix:///run/containerd/containerd.sock' >> /etc/systemd/system/kubelet.service.d/10-eksclt.al2.conf

Spinning up a new cluster will end up timing out because the workers of the said node group are never ready. I can see the nodes in kubectl get nodes, but they are marked as NotReady. Describing them gets me the containerd runtime at least. This is the error message that I get from the describe:

KubeletNotReady runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni plugin not initialized

Further looking at the pods, this is the list of running pods that I get (notice how they are not live):

(work) ubuntu@ip-122-41-64-11:~/github/cortex$ kubectl get pods -A
NAMESPACE     NAME                       READY   STATUS    RESTARTS   AGE
kube-system   aws-node-298mn             0/1     Running   0          20m
kube-system   aws-node-59ftb             0/1     Running   0          20m
kube-system   coredns-556765db45-ffcqq   0/1     Pending   0          27m
kube-system   coredns-556765db45-vrkqw   0/1     Pending   0          27m
kube-system   kube-proxy-2zmsf           1/1     Running   0          20m
kube-system   kube-proxy-drrkp           1/1     Running   0          20m

Inspecting aws-node-298mn further gets me these logs:

{"level":"info","ts":"2021-04-13T22:16:02.009Z","caller":"entrypoint.sh","msg":"Install CNI binary.."}
{"level":"info","ts":"2021-04-13T22:16:02.022Z","caller":"entrypoint.sh","msg":"Starting IPAM daemon in the background ... "}
{"level":"info","ts":"2021-04-13T22:16:02.023Z","caller":"entrypoint.sh","msg":"Checking for IPAM connectivity ... "}

Describing aws-node-298mn gets me this error Warning FailedCreatePodSandBox 37s (x17 over 4m4s) kubelet (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "60925fddf3daa50de73d593f885102ab3cc707b0e6839eef4c44a363316fa166": add cmd: Error received from AddNetwork gRPC call: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:50051: connect: connection refused".

Individual node inspection

SSHing into a node at random, I see that the containerd and kubelet services are active. Doing a systemctl cat kubelet gets me:

# /etc/systemd/system/kubelet.service
[Unit]
Description=Kubernetes Kubelet
Documentation=https://github.com/kubernetes/kubernetes
After=docker.service iptables-restore.service
Requires=docker.service

[Service]
ExecStartPre=/sbin/iptables -P FORWARD ACCEPT -w 5
ExecStart=/usr/bin/kubelet --cloud-provider aws \
    --config /etc/kubernetes/kubelet/kubelet-config.json \
    --kubeconfig /var/lib/kubelet/kubeconfig \
    --container-runtime docker \
    --network-plugin cni $KUBELET_ARGS $KUBELET_EXTRA_ARGS \
    --container-runtime=remote \
    --container-runtime-endpoint=unix:///run/containerd/containerd.sock
Restart=always
RestartSec=5
KillMode=process

[Install]
WantedBy=multi-user.target

# /etc/systemd/system/kubelet.service.d/10-eksclt.al2.conf
# eksctl-specific systemd drop-in unit for kubelet, for Amazon Linux 2 (AL2)

[Service]
# Local metadata parameters: REGION, AWS_DEFAULT_REGION
EnvironmentFile=/etc/eksctl/metadata.env
# Global and static parameters: CLUSTER_DNS, NODE_LABELS, NODE_TAINTS
EnvironmentFile=/etc/eksctl/kubelet.env
# Local non-static parameters: NODE_IP, INSTANCE_ID
EnvironmentFile=/etc/eksctl/kubelet.local.env

ExecStart=
ExecStart=/usr/bin/kubelet \
  --node-ip=${NODE_IP} \
  --node-labels=${NODE_LABELS},alpha.eksctl.io/instance-id=${INSTANCE_ID} \
  --max-pods=${MAX_PODS} \
  --register-node=true --register-with-taints=${NODE_TAINTS} \
  --cloud-provider=aws \
  --container-runtime=docker \
  --network-plugin=cni \
  --cni-bin-dir=/opt/cni/bin \
  --cni-conf-dir=/etc/cni/net.d \
  --pod-infra-container-image=${AWS_EKS_ECR_ACCOUNT}.dkr.ecr.${AWS_DEFAULT_REGION}.${AWS_SERVICES_DOMAIN}/eks/pause:3.3-eksbuild.1 \
  --kubeconfig=/etc/eksctl/kubeconfig.yaml \
  --config=/etc/eksctl/kubelet.yaml --container-runtime=remote --container-runtime-endpoint=unix:///run/containerd/containerd.sock

Looking at the logs of the kubelet service using journalctl -lfu kubelet gets me these errors:

Apr 13 21:18:27 ip-192-168-75-3.us-west-2.compute.internal kubelet[13797]: E0413 21:18:27.346548   13797 kubelet.go:2195] Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni plugin not initialized
Apr 13 21:18:28 ip-192-168-75-3.us-west-2.compute.internal kubelet[13797]: E0413 21:18:28.728896   13797 remote_runtime.go:351] ExecSync 16805590a6c7b8ab81730cfc2583beb9a9681a66b9b565a28d1b73b1af806fbe '/app/grpc-health-probe -addr=:50051' from runtime service failed: rpc error: code = DeadlineExceeded desc = failed to
 exec in container: timeout 1s exceeded: context deadline exceeded
Apr 13 21:18:29 ip-192-168-75-3.us-west-2.compute.internal kubelet[13797]: E0413 21:18:29.052316   13797 remote_runtime.go:351]

And then looking at /etc/cni/net.d I see that the directory is empty, which I think shouldn't be the case.

Interesting reveal

If I don't add those commands to the preBootstrapCommands section, then the cluster will get provisioned successfully. SSHing into each instance and making the same exact modifications as those in the preBootstrapCommands section, followed by systemctl daemon-reload && systemctl restart kubelet will give me a fully functional cluster that's using containerd as the runtime.

Does anyone know what's wrong with this picture? Help on this would be greatly appreciated!

The text was updated successfully, but these errors were encountered:

Callisto13 · 2021-04-14T08:26:29Z

Hi @RobertLucian thanks for asking and providing lots of good detail.

This does seem like a matter of ordering and what-exists-where-at-what-time, although other vague threads on the interwebz do indicate other "things" 🤷‍♀️ . I have not tried to do this personally so have no idea 😄 .

I will hopefully have time to play around with this soon, I am interested to see what is going on. Anyone else is free to jump in here!

In the meantime, according to this AWS thread on making containerd a legit option on AL2/Ubuntu EKS nodes, it seems Bottlerocket was added for this purpose, so if you are interested you could try that. The conversation around making containerd easily usable on AL2/Ubuntu has kind of stalled, so I don't know what the plan is there.

RobertLucian · 2021-04-14T18:11:42Z

Hi @Callisto13 thanks a lot for coming back with a reply! This is much appreciated!

This does seem like a matter of ordering and what-exists-where-at-what-time, although other vague threads on the interwebz do indicate other "things" 🤷‍♀️

With my limited context, yes I'd say it does feel like that would be the case.

Hmm, yes, Bottlerocket might be an alternative. Do you know if there are any disadvantages to using Bottlerocket as opposed to going with the AL2/Ubuntu EKS-optimized AMI images? Are GPUs/Inferentia nodes (and workloads) supported? And do you know if their AMIs are available in all regions?

Thanks for your response again! And I'll be waiting for an update from you. Let me know if there's anything I can do here.

Callisto13 · 2021-04-21T08:27:56Z

Do you know if there are any disadvantages to using Bottlerocket as opposed to going with the AL2/Ubuntu EKS-optimized AMI images?

Off the top of my head I am not sure, @aclevername did you come across anything while you were working on that volume thing?

Are GPUs/Inferentia nodes (and workloads) supported?

From the looks of things this is something they are still working out 😞 .

And do you know if their AMIs are available in all regions?

The list of supported regions can be seen here, coverage seems to be pretty decent.

I'll start running some hacky things on this today and see what I come up with 👍 . Can you confirm what AMI or instance type you are using?

Callisto13 · 2021-04-21T16:52:36Z

Initial notes after some quick poking around:

Confirmed that this happens:

If I don't add those commands to the preBootstrapCommands section, then the cluster will get provisioned successfully.
SSHing into each instance and making the same exact modifications as those in the preBootstrapCommands section,
followed by systemctl daemon-reload && systemctl restart kubelet will give me a fully functional cluster that's using
containerd as the runtime.

(The yum install containerd is not necessary since it is already there as docker uses it internally.)

kubelet describe node <node> says Container Runtime Version: containerd://1.4.4, and from what I can see the kubelet is fine.
The weird thing is that when I rebuilt eksctl to just have the --container-runtime=remote and --container-runtime-endpoint in the 10-eksclt.al2.conf from the start, kubelet describe node <node> again says Container Runtime Version: containerd://1.4.4, but the kubelet errors with the same things as in the description:
```
Exec <foo> '/app/grpc-health-probe -addr=:50051' from runtime service failed: rpc error: code = DeadlineExceeded desc = failed to exec in container: timeout 1s exceeded: context deadline exceeded
Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error:  cni plugin not initialized
```
So I would guess there is maybe something happening with the docker setup which is no longer done. Either that or the success above is a false positive, and if left long enough it would fall over.
There is pretty much nothing on google about people swapping the runtime of their EKS nodes, beyond a thread asking AWS to support containerd for EKS, and following other guides on how to swap. I had a chat with someone on the EKS team, and they are not currently aware of anyone swapping runtimes on AL2/Ubuntu nodes. They also mentioned that there is some discussion about dedicated containerd AMIs in the works.
One last thing to note is that docker is being deprecated in favour of containerd from k8s 1.20. This should be out in EKS in a couple of weeks, so I am curious to see what runtime they have going on there.

github-actions · 2021-05-22T01:56:15Z

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

cw-sakamoto · 2021-05-24T08:59:33Z

We need change aws-cnisetting for containerd. Do you have this setting?
https://github.com/aws/amazon-vpc-cni-k8s#container-runtime

After making this setting, it worked with preBootstrapCommands below.
(eksctl 0.51.0)

preBootstrapCommands:
  - "mkdir -p /etc/containerd"
  - "containerd config default > /tmp/containerd-config.toml"
  - "cp /tmp/containerd-config.toml /etc/containerd/config.toml"
  - "systemctl restart containerd"
  - "sed -i -e 's#--container-runtime\ docker#--container-runtime\ remote --container-runtime-endpoint\ unix:///run/containerd/containerd.sock#' /etc/systemd/system/kubelet.service"
  - "systemctl daemon-reload"
  - "systemctl restart kubelet.service"

% kubectl get node -o wide
NAME                                               STATUS   ROLES    AGE   VERSION              INTERNAL-IP     EXTERNAL-IP   OS-IMAGE         KERNEL-VERSION                CONTAINER-RUNTIME
ip-10-11-153-145.ap-northeast-1.compute.internal   Ready    <none>   23m   v1.19.6-eks-49a6c0   10.11.153.145   <none>        Amazon Linux 2   5.4.117-58.216.amzn2.x86_64   containerd://1.4.1
ip-10-11-184-250.ap-northeast-1.compute.internal   Ready    <none>   16m   v1.19.6-eks-49a6c0   10.11.184.250   <none>        Amazon Linux 2   5.4.117-58.216.amzn2.x86_64   containerd://1.4.1

The configuration file of kubelet service has changed since 0.47.0.
FYI: https://github.com/weaveworks/eksctl/releases/tag/0.47.0

Callisto13 · 2021-05-24T16:13:31Z

Thanks @cw-sakamoto !

More general info: containerd will be the default runtime in EKS from 1.21, which should be available in July.

cw-sakamoto · 2021-06-06T02:52:25Z

https://kubernetes.io/docs/setup/production-environment/container-runtimes/#containerd-systemd

I also figured out prebootstrapCommands, which works by changing cgroupdriver to systemd.
I install crictl as well, as it is useful.

  preBootstrapCommands:
  - 'mkdir -p /etc/containerd'
  - 'containerd config default | sed "/containerd.runtimes.runc.options/a SystemdCgroup = true" > /etc/containerd/config.toml'
  - 'systemctl restart containerd'
  - 'cat /etc/kubernetes/kubelet/kubelet-config.json | jq ".cgroupDriver |= \"systemd\"" > /tmp/kubelet-config.json'
  - 'mv /tmp/kubelet-config.json /etc/kubernetes/kubelet/kubelet-config.json'
  - 'sed -i -e "s#--container-runtime\ docker#--container-runtime\ remote --container-runtime-endpoint\ unix:///run/containerd/containerd.sock#" /etc/systemd/system/kubelet.service'
  - 'systemctl daemon-reload'
  - 'systemctl restart kubelet.service'
  - 'VERSION="v1.20.0";wget https://github.com/kubernetes-sigs/cri-tools/releases/download/$VERSION/crictl-$VERSION-linux-amd64.tar.gz -O /tmp/crictl-linux-amd64.tar.gz'
  - 'tar zxvf /tmp/crictl-linux-amd64.tar.gz -C /usr/local/bin'

github-actions · 2021-07-07T01:46:22Z

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

github-actions · 2021-07-13T01:47:25Z

This issue was closed because it has been stalled for 5 days with no activity.

RobertLucian added the kind/help Request for help label Apr 13, 2021

RobertLucian mentioned this issue Apr 14, 2021

Allow the specification of backup image registry hosts cortexlabs/cortex#1995

Open

rothgar mentioned this issue Apr 27, 2021

Container runtime bootstrap awslabs/amazon-eks-ami#656

Closed

github-actions bot added the stale label May 22, 2021

Callisto13 removed the stale label May 24, 2021

github-actions bot added the stale label Jul 7, 2021

github-actions bot closed this as completed Jul 13, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Setting up containerd with eksctl (with preBootstrapCommands) #3572

Setting up containerd with eksctl (with preBootstrapCommands) #3572

RobertLucian commented Apr 13, 2021 •

edited

Loading

Callisto13 commented Apr 14, 2021 •

edited

Loading

RobertLucian commented Apr 14, 2021

Callisto13 commented Apr 21, 2021 •

edited

Loading

Callisto13 commented Apr 21, 2021 •

edited

Loading

github-actions bot commented May 22, 2021

cw-sakamoto commented May 24, 2021

Callisto13 commented May 24, 2021 •

edited

Loading

cw-sakamoto commented Jun 6, 2021

github-actions bot commented Jul 7, 2021

github-actions bot commented Jul 13, 2021

Setting up containerd with eksctl (with preBootstrapCommands) #3572

Setting up containerd with eksctl (with preBootstrapCommands) #3572

Comments

RobertLucian commented Apr 13, 2021 • edited Loading

High level

Individual node inspection

Interesting reveal

Callisto13 commented Apr 14, 2021 • edited Loading

RobertLucian commented Apr 14, 2021

Callisto13 commented Apr 21, 2021 • edited Loading

Callisto13 commented Apr 21, 2021 • edited Loading

github-actions bot commented May 22, 2021

cw-sakamoto commented May 24, 2021

Callisto13 commented May 24, 2021 • edited Loading

cw-sakamoto commented Jun 6, 2021

github-actions bot commented Jul 7, 2021

github-actions bot commented Jul 13, 2021

RobertLucian commented Apr 13, 2021 •

edited

Loading

Callisto13 commented Apr 14, 2021 •

edited

Loading

Callisto13 commented Apr 21, 2021 •

edited

Loading

Callisto13 commented Apr 21, 2021 •

edited

Loading

Callisto13 commented May 24, 2021 •

edited

Loading