Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slower image builds because of an inconsistent containerd UUID in buildkit #412

Closed
ollypom opened this issue May 22, 2023 · 1 comment · Fixed by #461
Closed

Slower image builds because of an inconsistent containerd UUID in buildkit #412

ollypom opened this issue May 22, 2023 · 1 comment · Fixed by #461
Labels
bug Something isn't working

Comments

@ollypom
Copy link
Contributor

ollypom commented May 22, 2023

Describe the bug
Buildkit supports multiple output types. By default in Finch, images get built using buildkit's "docker" output type even though nerdctl's default is the "image" type.

When using the "docker" type, Buildkit will create a tarball (to a docker image format) and then nerdctl will load that tarball into the containerd image store. Removing the tarball steps will provide a significant performance improvements.

$ finch build -t demo .
[+] Building 2.3s (9/10)                                                                                                                                                                
<snip>
 => exporting to docker image format                                                                                                                                               0.7s
 => => exporting layers                                                                                                                                                            0.0s
 => => exporting manifest sha256:c7a206f2dc1ef877aa2d259177383d21f9a6fd5ac9b60b95b4b68f109d81e0d6                                                                                  0.0s
 => => exporting config sha256:91d3769ed7b5b3c3bed9c6d8bb6518ae2a8e9c8ad8a0af70ae98c8935582f477                                                                                    0.0s
 => => sending tarball   

vs

$ finch build -t demo --output type=image,unpack=true .
[+] Building 1.5s (10/10) FINISHED                                                                                                                                                      
<snip>
 => exporting to image                                                                                                                                                             0.0s
 => => exporting layers                                                                                                                                                            0.0s
 => => exporting manifest sha256:c7a206f2dc1ef877aa2d259177383d21f9a6fd5ac9b60b95b4b68f109d81e0d6                                                                                  0.0s
 => => exporting config sha256:91d3769ed7b5b3c3bed9c6d8bb6518ae2a8e9c8ad8a0af70ae98c8935582f477                                                                                    0.0s
 => => naming to docker.io/library/demo:latest                                                                                                                                     0.0s
 => => unpacking to docker.io/library/demo:latest   

The reason why the "docker" type is chosen, is because nerdctl falls back to "docker" if a contained buildkit workers parameters do not match. And by default in Finch, that is the case. The buildkit worker's containerd UUID label does not match containerds UUID..

Steps to reproduce

After getting a shell into the Finch VM:

$ sudo ctr info
{
    "server": {
        "uuid": "613bbaa1-c064-497c-8917-1169147b2c2f",
        "pid": 1225,
        "pidns": 4026531836
    }
}

$ sudo buildctl debug workers --format "{{json .}}" | jq -r '. | .[].labels'
{
  "org.mobyproject.buildkit.worker.containerd.namespace": "finch",
  "org.mobyproject.buildkit.worker.containerd.uuid": "83906608-6cc8-4a5e-b277-35ed70709712",
  "org.mobyproject.buildkit.worker.executor": "containerd",
  "org.mobyproject.buildkit.worker.hostname": "lima-finch",
  "org.mobyproject.buildkit.worker.network": "host",
  "org.mobyproject.buildkit.worker.selinux.enabled": "false",
  "org.mobyproject.buildkit.worker.snapshotter": "overlayfs"
}

After a quick restart of the buildkid daemon, it fixes itself and the UUIDs match:

$ sudo systemctl restart buildkit
$ sudo buildctl debug workers --format "{{json .}}" | jq -r '. | .[].labels'
{
  "org.mobyproject.buildkit.worker.containerd.namespace": "finch",
  "org.mobyproject.buildkit.worker.containerd.uuid": "613bbaa1-c064-497c-8917-1169147b2c2f",
  "org.mobyproject.buildkit.worker.executor": "containerd",
  "org.mobyproject.buildkit.worker.hostname": "lima-finch",
  "org.mobyproject.buildkit.worker.network": "host",
  "org.mobyproject.buildkit.worker.selinux.enabled": "false",
  "org.mobyproject.buildkit.worker.snapshotter": "overlayfs"
}

And we can build to the "image" type by default! :)

$ finch build -t test .
<snip>
 => exporting to image                                                                                                                                                                                                            0.5s
 => => exporting layers                                                                                                                                                                                                           0.4s
 => => exporting manifest sha256:6ac780490dde7e816a55a3c4e4b59117765ad41336583cd95e1bfcfd4c9561b3                                                                                                                                 0.0s
 => => exporting config sha256:fdfc1e6cb0727b08ef324fe6717e42d05649d689eb3070227af41977dac59da0                                                                                                                                   0.0s
 => => naming to docker.io/library/test:latest                                                                                                                                                                                    0.0s
 => => unpacking to docker.io/library/test:latest 

But if we restart the Finch VM (finch vm stop / finch vm start), it goes back to the wrong UUID again and needs a buildkit daemon restart!!! :( I wonder if its a start order thing but I'm unable to find the part of code in Finch / Lima / Buildkit that could control that.

@ollypom ollypom added the bug Something isn't working label May 22, 2023
@pendo324
Copy link
Member

I think I figured this out. This is actually a pretty interesting issue. I went down the rabbit hole of systemd service ordering, but no matter what I did, nothing worked except restarting buildkit after boot was complete.

Turns out, that's because we restart containerd manually in our provisioning script that makes containerd user the persistent data disk. This restart changes the containerd server UUID. The issue is that we also need to restart buildkit.

pendo324 added a commit that referenced this issue Aug 10, 2023
Issue #, if available: Fixes #412 

*Description of changes:*
- Previously, only containerd was restarted after configuring it to use
the data on the persistent disk. This changes the UUID of the server
worker. BuildKit also needs to be restarted to use the proper UUID. See
issue for why this is important.

*Testing done:*
- Local testing


- [x] I've reviewed the guidance in CONTRIBUTING.md


#### License Acceptance

By submitting this pull request, I confirm that my contribution is made
under the terms of the Apache 2.0 license.

---------

Signed-off-by: Justin Alvarez <alvajus@amazon.com>
pendo324 added a commit that referenced this issue Jan 29, 2024
…#767)

Issue #, if available: re-verified #412
- Through extensive e2e test debugging, I noticed that soci and stargz
snapshotters weren't persisting data as expected. After debugging, I
found some context in these two PRs:
  - awslabs/soci-snapshotter#881
  - containerd/stargz-snapshotter#1526
Unfortunately, neither of them are deployed yet, so I've implemented a
hacky workaround for now. After this change, an image/container can be
pull/run, the VM can be restarted, and then the container can be
re-started again.

*Description of changes:*
- Redo how BuildKit/Stargz/SOCI are related to containerd using
[systemd's `PartOf`

](https://www.freedesktop.org/software/systemd/man/latest/systemd.unit.html#PartOf=)
- this ensures that all of these services are restarted when containerd
is restarted, which the lack of has caused errors in the past
- Create some missing directories that might throw errors in cloud-init
- Ensure that `SIGTERM` is used to kill the snapshotter services for now

*Testing done:*
- manual testing



- [x] I've reviewed the guidance in CONTRIBUTING.md


#### License Acceptance

By submitting this pull request, I confirm that my contribution is made
under the terms of the Apache 2.0 license.

---------

Signed-off-by: Justin Alvarez <alvajus@amazon.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants