-
Notifications
You must be signed in to change notification settings - Fork 95
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Slower image builds because of an inconsistent containerd UUID in buildkit #412
Comments
I think I figured this out. This is actually a pretty interesting issue. I went down the rabbit hole of systemd service ordering, but no matter what I did, nothing worked except restarting buildkit after boot was complete. Turns out, that's because we restart containerd manually in our provisioning script that makes containerd user the persistent data disk. This restart changes the containerd server UUID. The issue is that we also need to restart buildkit. |
Issue #, if available: Fixes #412 *Description of changes:* - Previously, only containerd was restarted after configuring it to use the data on the persistent disk. This changes the UUID of the server worker. BuildKit also needs to be restarted to use the proper UUID. See issue for why this is important. *Testing done:* - Local testing - [x] I've reviewed the guidance in CONTRIBUTING.md #### License Acceptance By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license. --------- Signed-off-by: Justin Alvarez <alvajus@amazon.com>
…#767) Issue #, if available: re-verified #412 - Through extensive e2e test debugging, I noticed that soci and stargz snapshotters weren't persisting data as expected. After debugging, I found some context in these two PRs: - awslabs/soci-snapshotter#881 - containerd/stargz-snapshotter#1526 Unfortunately, neither of them are deployed yet, so I've implemented a hacky workaround for now. After this change, an image/container can be pull/run, the VM can be restarted, and then the container can be re-started again. *Description of changes:* - Redo how BuildKit/Stargz/SOCI are related to containerd using [systemd's `PartOf` ](https://www.freedesktop.org/software/systemd/man/latest/systemd.unit.html#PartOf=) - this ensures that all of these services are restarted when containerd is restarted, which the lack of has caused errors in the past - Create some missing directories that might throw errors in cloud-init - Ensure that `SIGTERM` is used to kill the snapshotter services for now *Testing done:* - manual testing - [x] I've reviewed the guidance in CONTRIBUTING.md #### License Acceptance By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license. --------- Signed-off-by: Justin Alvarez <alvajus@amazon.com>
Describe the bug
Buildkit supports multiple output types. By default in Finch, images get built using buildkit's "docker" output type even though nerdctl's default is the "image" type.
When using the "docker" type, Buildkit will create a tarball (to a docker image format) and then nerdctl will load that tarball into the containerd image store. Removing the tarball steps will provide a significant performance improvements.
vs
The reason why the "docker" type is chosen, is because nerdctl falls back to "docker" if a contained buildkit workers parameters do not match. And by default in Finch, that is the case. The buildkit worker's containerd UUID label does not match containerds UUID..
Steps to reproduce
After getting a shell into the Finch VM:
After a quick restart of the buildkid daemon, it fixes itself and the UUIDs match:
And we can build to the "image" type by default! :)
But if we restart the Finch VM (finch vm stop / finch vm start), it goes back to the wrong UUID again and needs a buildkit daemon restart!!! :( I wonder if its a start order thing but I'm unable to find the part of code in Finch / Lima / Buildkit that could control that.
The text was updated successfully, but these errors were encountered: