Slower image builds because of an inconsistent containerd UUID in buildkit #412

ollypom · 2023-05-22T13:57:11Z

Describe the bug
Buildkit supports multiple output types. By default in Finch, images get built using buildkit's "docker" output type even though nerdctl's default is the "image" type.

When using the "docker" type, Buildkit will create a tarball (to a docker image format) and then nerdctl will load that tarball into the containerd image store. Removing the tarball steps will provide a significant performance improvements.

$ finch build -t demo .
[+] Building 2.3s (9/10)                                                                                                                                                                
<snip>
 => exporting to docker image format                                                                                                                                               0.7s
 => => exporting layers                                                                                                                                                            0.0s
 => => exporting manifest sha256:c7a206f2dc1ef877aa2d259177383d21f9a6fd5ac9b60b95b4b68f109d81e0d6                                                                                  0.0s
 => => exporting config sha256:91d3769ed7b5b3c3bed9c6d8bb6518ae2a8e9c8ad8a0af70ae98c8935582f477                                                                                    0.0s
 => => sending tarball

vs

$ finch build -t demo --output type=image,unpack=true .
[+] Building 1.5s (10/10) FINISHED                                                                                                                                                      
<snip>
 => exporting to image                                                                                                                                                             0.0s
 => => exporting layers                                                                                                                                                            0.0s
 => => exporting manifest sha256:c7a206f2dc1ef877aa2d259177383d21f9a6fd5ac9b60b95b4b68f109d81e0d6                                                                                  0.0s
 => => exporting config sha256:91d3769ed7b5b3c3bed9c6d8bb6518ae2a8e9c8ad8a0af70ae98c8935582f477                                                                                    0.0s
 => => naming to docker.io/library/demo:latest                                                                                                                                     0.0s
 => => unpacking to docker.io/library/demo:latest

The reason why the "docker" type is chosen, is because nerdctl falls back to "docker" if a contained buildkit workers parameters do not match. And by default in Finch, that is the case. The buildkit worker's containerd UUID label does not match containerds UUID..

Steps to reproduce

After getting a shell into the Finch VM:

$ sudo ctr info
{
    "server": {
        "uuid": "613bbaa1-c064-497c-8917-1169147b2c2f",
        "pid": 1225,
        "pidns": 4026531836
    }
}

$ sudo buildctl debug workers --format "{{json .}}" | jq -r '. | .[].labels'
{
  "org.mobyproject.buildkit.worker.containerd.namespace": "finch",
  "org.mobyproject.buildkit.worker.containerd.uuid": "83906608-6cc8-4a5e-b277-35ed70709712",
  "org.mobyproject.buildkit.worker.executor": "containerd",
  "org.mobyproject.buildkit.worker.hostname": "lima-finch",
  "org.mobyproject.buildkit.worker.network": "host",
  "org.mobyproject.buildkit.worker.selinux.enabled": "false",
  "org.mobyproject.buildkit.worker.snapshotter": "overlayfs"
}

After a quick restart of the buildkid daemon, it fixes itself and the UUIDs match:

$ sudo systemctl restart buildkit
$ sudo buildctl debug workers --format "{{json .}}" | jq -r '. | .[].labels'
{
  "org.mobyproject.buildkit.worker.containerd.namespace": "finch",
  "org.mobyproject.buildkit.worker.containerd.uuid": "613bbaa1-c064-497c-8917-1169147b2c2f",
  "org.mobyproject.buildkit.worker.executor": "containerd",
  "org.mobyproject.buildkit.worker.hostname": "lima-finch",
  "org.mobyproject.buildkit.worker.network": "host",
  "org.mobyproject.buildkit.worker.selinux.enabled": "false",
  "org.mobyproject.buildkit.worker.snapshotter": "overlayfs"
}

And we can build to the "image" type by default! :)

$ finch build -t test .
<snip>
 => exporting to image                                                                                                                                                                                                            0.5s
 => => exporting layers                                                                                                                                                                                                           0.4s
 => => exporting manifest sha256:6ac780490dde7e816a55a3c4e4b59117765ad41336583cd95e1bfcfd4c9561b3                                                                                                                                 0.0s
 => => exporting config sha256:fdfc1e6cb0727b08ef324fe6717e42d05649d689eb3070227af41977dac59da0                                                                                                                                   0.0s
 => => naming to docker.io/library/test:latest                                                                                                                                                                                    0.0s
 => => unpacking to docker.io/library/test:latest

But if we restart the Finch VM (finch vm stop / finch vm start), it goes back to the wrong UUID again and needs a buildkit daemon restart!!! :( I wonder if its a start order thing but I'm unable to find the part of code in Finch / Lima / Buildkit that could control that.

The text was updated successfully, but these errors were encountered:

pendo324 · 2023-06-26T23:32:51Z

I think I figured this out. This is actually a pretty interesting issue. I went down the rabbit hole of systemd service ordering, but no matter what I did, nothing worked except restarting buildkit after boot was complete.

Turns out, that's because we restart containerd manually in our provisioning script that makes containerd user the persistent data disk. This restart changes the containerd server UUID. The issue is that we also need to restart buildkit.

Issue #, if available: Fixes #412 *Description of changes:* - Previously, only containerd was restarted after configuring it to use the data on the persistent disk. This changes the UUID of the server worker. BuildKit also needs to be restarted to use the proper UUID. See issue for why this is important. *Testing done:* - Local testing - [x] I've reviewed the guidance in CONTRIBUTING.md #### License Acceptance By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license. --------- Signed-off-by: Justin Alvarez <alvajus@amazon.com>

…#767) Issue #, if available: re-verified #412 - Through extensive e2e test debugging, I noticed that soci and stargz snapshotters weren't persisting data as expected. After debugging, I found some context in these two PRs: - awslabs/soci-snapshotter#881 - containerd/stargz-snapshotter#1526 Unfortunately, neither of them are deployed yet, so I've implemented a hacky workaround for now. After this change, an image/container can be pull/run, the VM can be restarted, and then the container can be re-started again. *Description of changes:* - Redo how BuildKit/Stargz/SOCI are related to containerd using [systemd's `PartOf` ](https://www.freedesktop.org/software/systemd/man/latest/systemd.unit.html#PartOf=) - this ensures that all of these services are restarted when containerd is restarted, which the lack of has caused errors in the past - Create some missing directories that might throw errors in cloud-init - Ensure that `SIGTERM` is used to kill the snapshotter services for now *Testing done:* - manual testing - [x] I've reviewed the guidance in CONTRIBUTING.md #### License Acceptance By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license. --------- Signed-off-by: Justin Alvarez <alvajus@amazon.com>

ollypom added the bug Something isn't working label May 22, 2023

pendo324 mentioned this issue Jun 26, 2023

fix: Restart buildkit after containerd when provisioning #461

Merged

1 task

pendo324 closed this as completed in #461 Aug 10, 2023

pendo324 mentioned this issue Jan 19, 2024

fix: properly reload persistent snapshotter data and restart services #767

Merged

1 task

bestickley mentioned this issue Mar 26, 2024

Speed up "sending tarball" step for PythonFunction CDK Construct #871

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Slower image builds because of an inconsistent containerd UUID in buildkit #412

Slower image builds because of an inconsistent containerd UUID in buildkit #412

ollypom commented May 22, 2023

pendo324 commented Jun 26, 2023

Slower image builds because of an inconsistent containerd UUID in buildkit #412

Slower image builds because of an inconsistent containerd UUID in buildkit #412

Comments

ollypom commented May 22, 2023

pendo324 commented Jun 26, 2023