Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Modifying the containerd config file at runtime is unreliable #1628

Closed
tico24 opened this issue Feb 6, 2024 · 21 comments
Closed

Modifying the containerd config file at runtime is unreliable #1628

tico24 opened this issue Feb 6, 2024 · 21 comments

Comments

@tico24
Copy link

tico24 commented Feb 6, 2024

What happened:
When providing a basic userdata configuration to the EKS node on startup, the node fails to start and join the cluster. If I remove the userdata, it works.

In versions prior to amazon-eks-node-1.29-v20240202, this userdata worked flawlessly.

What you expected to happen: I would expect breaking changes to be documented.

How to reproduce it (as minimally and precisely as possible):

Here is my userdata:

Content-Type: multipart/mixed; boundary="//"
MIME-Version: 1.0

--//
Content-Transfer-Encoding: 7bit
Content-Type: text/x-shellscript
Mime-Version: 1.0

#!/bin/bash
cat /etc/containerd/config.toml | sed -E 's/discard_unpacked_layers = true/discard_unpacked_layers = false/g' > /etc/containerd/config-foo.toml

set -ex
cat <<-EOF > /etc/profile.d/bootstrap.sh
export CONTAINERD_CONFIG_FILE=/etc/containerd/config-foo.toml
EOF
sed -i '/^set -o errexit/a\\nsource /etc/profile.d/bootstrap.sh' /etc/eks/bootstrap.sh

--//--

This is applied to an EKS node on startup. The node no longer connects to the EKS cluster.
EKS 1.29, although similar reports in this issue on EKS 1.28 - the AMI is the common demoninator here.

Anything else we need to know?:

Environment:

  • AWS Region: us-east-2
  • Instance Type(s): many variable types
  • EKS Platform version (use aws eks describe-cluster --name <name> --query cluster.platformVersion): "eks.1"
  • Kubernetes version (use aws eks describe-cluster --name <name> --query cluster.version): "1.29"
  • AMI Version: amazon-eks-node-1.29-v20240202
  • Kernel (e.g. uname -a): Linux ip-10-205-184-145.us-east-2.compute.internal 5.10.205-195.807.amzn2.aarch64 Template is missing source_ami_id in the variables section #1 SMP Tue Jan 16 18:29:00 UTC 2024 aarch64 aarch64 aarch64 GNU/Linux
  • Release information (run cat /etc/eks/release on a node):
BASE_AMI_ID="ami-086edf6f7e8d01cff"
BUILD_TIME="Fri Feb  2 16:46:37 UTC 2024"
BUILD_KERNEL="5.10.205-195.807.amzn2.aarch64"
ARCH="aarch64"

(x64 also affected)

@tico24 tico24 changed the title Can no longer add userdata since amazon-eks-node-1.29-v20240129 Can no longer add userdata since amazon-eks-node-1.29-v20240202 Feb 6, 2024
@tico24
Copy link
Author

tico24 commented Feb 6, 2024

Another user has posted some logs here: spegel-org/spegel#217 (comment)

@cartermckinnon
Copy link
Member

cartermckinnon commented Feb 6, 2024

From those logs:

> Created symlink from /etc/systemd/system/multi-user.target.wants/kubelet.service to /etc/systemd/system/kubelet.service.
> Failed to start kubelet.service: Unit not found.

That is the output from these two lines:

systemctl enable kubelet
systemctl start kubelet

Because systemctl enable kubelet succeeded, we know that /etc/systemd/system/kubelet.service exists.

So this error:

> Failed to start kubelet.service: Unit not found.

Is probably referring to one of the dependent services, my guess would be sandbox-image.service. That is created within this block:

if ! cmp -s /etc/eks/containerd/containerd-config.toml /etc/containerd/config.toml; then
sudo cp -v /etc/eks/containerd/containerd-config.toml /etc/containerd/config.toml
sudo cp -v /etc/eks/containerd/sandbox-image.service /etc/systemd/system/sandbox-image.service
sudo chown root:root /etc/systemd/system/sandbox-image.service
systemctl daemon-reload
systemctl enable containerd sandbox-image
systemctl restart sandbox-image containerd
fi

When you create your custom containerd config file, that sed isn't changing anything:

> cat /etc/containerd/config.toml | sed -E 's/discard_unpacked_layers = true/discard_unpacked_layers = false/g' > /etc/containerd/config-foo.toml

# diff is empty
> diff /etc/containerd/config.toml /etc/containerd/config-foo.toml 

And your custom containerd config file overwrites /etc/eks/containerd-config.toml:

if [[ -n "${CONTAINERD_CONFIG_FILE}" ]]; then
sudo cp -v "${CONTAINERD_CONFIG_FILE}" /etc/eks/containerd/containerd-config.toml
fi

So this comparison will never succeed:

if ! cmp -s /etc/eks/containerd/containerd-config.toml /etc/containerd/config.toml; then

Which agrees with that log file.


The creation of the sandbox-image.service unit could probably be moved outside that block; but I think you would encounter the same issue on previous AMI releases.

Is there a specific AMI release this worked on?

@tico24
Copy link
Author

tico24 commented Feb 6, 2024

Is there a specific AMI release this worked on?

Anything prior to this one. It has been working for months without issue.

To give a specific answer, I have clusters running 1.29.0-20240117 without an issue.

@cartermckinnon
Copy link
Member

cartermckinnon commented Feb 6, 2024

Hm okay, I think I see the problem. We build the official AMI's with the cache_container_images=true option, which previously included the sandbox image. That caused issues with kubelet image garbage collection on 1.29, and it was removed. As a result, we no longer modify the default /etc/containerd/config.toml at build time: 41dfa22

So the line you're looking for (discard_unpacked_images = true) isn't there.

@cartermckinnon
Copy link
Member

I'll see what we can do to unblock your use case, but in general I don't think we can guarantee that the sed will always produce the desired result. The --containerd-config-file option is intended for the user to define their entire containerd config, it's not well-suited for injecting a single value into the final config.

@cartermckinnon cartermckinnon changed the title Can no longer add userdata since amazon-eks-node-1.29-v20240202 Modifying the containerd config file at runtime is unreliable Feb 6, 2024
@tico24
Copy link
Author

tico24 commented Feb 6, 2024

Thanks for this. I can probably inject the entire config file in the userdata if that's the recommended approach.

@danielloader
Copy link

I know there's the risk of configuration sprawl but a flag/argument that could be passed to the bootstrap script to keep the container layers (and thus reverting the behaviour to how it was before the AMI defaulted to discarding them end of last year) would be good too.

@dims
Copy link
Member

dims commented Feb 6, 2024

@cartermckinnon @danielloader @tico24 we could look at using imports option in config.toml and let folks drop their own snippets in another directory?

https://github.com/containerd/containerd/blob/main/docs/man/containerd-config.toml.5.md#complete-configuration

sed and other tools will be fragile by definition!

@phillebaba
Copy link

I think that using imports would be a great idea. Talos does a pretty good job using imports to customize Containerd configuration.

https://www.talos.dev/v1.6/talos-guides/configuration/containerd/

Importing from a directory would allow overrides to easily be written with minimal impact on existing solutions.

@cartermckinnon
Copy link
Member

flag/argument that could be passed to the bootstrap script to keep the container layers

@danielloader What’s the context here?

@danielloader
Copy link

danielloader commented Feb 6, 2024

flag/argument that could be passed to the bootstrap script to keep the container layers

@danielloader What’s the context here?

Using spegel to do intra node container image sharing. It depends on the layers being retained after pull.

spegel-org/spegel#217

@cartermckinnon
Copy link
Member

cartermckinnon commented Feb 6, 2024

Interesting, thanks!

I think #1630 would cover that case, and my understanding of the merge behavior is that whatever is defined in the imports will override what EKS writes to /etc/containerd/config.toml, which should prevent surprises if/when the EKS defaults change.

@phillebaba
Copy link

@cartermckinnon yes you are right with your assumption that imports override existing configuration. I will add this to the documentation in Spegel as soon as we get a new AMI release with #1630, that way hopefully we wont get other users raising similar issues.

@Raj-Popat
Copy link

Hey guys any-update on the #1630?

@phillebaba
Copy link

This can now be closed and #1630 is merged and part of a release.

@zaidmnsr
Copy link

I updated the config.toml using below command as eks bootstrap script is using this file as the base file.

sed -i 's/discard_unpacked_layers = true/discard_unpacked_layers = false/g' /etc/eks/containerd/containerd-config.toml

Once you update this, you can verify this using below command from your worker node.

containerd config dump

@cartermckinnon
Copy link
Member

@zaidmnsr please check the thread above, we explicitly recommend against that approach, because it's very fragile

@zaidmnsr
Copy link

@cartermckinnon I agree with you that my approach is fragile.

The userdata provided at the top of the thread as problem statement will fail as default file in eks ami /etc/containerd/config.toml is all commented . So eventually using sed on that and explicitly setting conatinerd file (export CONTAINERD_CONFIG_FILE=/etc/containerd/config-foo.toml )will fail in joining nodes as all contents are hashed.

In the eks bootstrap script, it overwrites the file from location /etc/eks/containerd/containerd-config.toml.

So, I decided to go with that approach on directly updating /etc/eks/containerd/containerd-config.toml .

@phillebaba
Copy link

phillebaba commented Aug 13, 2024

@zaidmnsr here is an example of how you can override values without using sed. It is a better option because you are not relying on the Containerd config in the AMI to stay the same.

https://github.com/spegel-org/spegel/blob/main/docs/COMPATIBILITY.md#eks

Any toml file that you place in the directory /etc/containerd/config.d/ will be picked up as an override.

@cartermckinnon
Copy link
Member

Our docs touch on this as well: https://awslabs.github.io/amazon-eks-ami/usage/al2/#customizing-containerd-config

@vyrwu
Copy link

vyrwu commented Sep 2, 2024

As containerd patches entire dictionaries, rather than merging keys, we are also opting to go with editing the EKS template config as part of our pre-bootstrap user-data:

sed -i 's/discard_unpacked_layers = true/discard_unpacked_layers = false/g'  /etc/eks/containerd/containerd-config.toml

While still fragile, this is the best solution we've got until merging of imported dicts is supported: containerd/containerd#5837. We cannot rely on customizing-containerd-config as there are tools (like nvidia-device-plugin) which modify the containerd configuration as well.

If merging keys is required, I found success following containerd/containerd#5837 (comment) (note that here we're patching the /etc/eks/containerd/containerd-config.toml, not /etc/containerd/config.toml).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants