Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add "Builder: oci-import" support #61

Merged
merged 6 commits into from
Dec 17, 2022

Conversation

tianon
Copy link
Member

@tianon tianon commented Dec 3, 2022

Closes #51

In the case of base images (debian, alpine, ubuntu, etc), using a Dockerfile as our method of ingestion doesn't really buy us very much. It made sense at the time it was implemented ("all Dockerfile, all the time"), but at this point they're all some variation on FROM scratch \n ADD foo.tar.xz / \n CMD ["/bin/some-shell"], and cannot reasonably be "rebuilt" when their base image changes (which is one of the key functions of the official images) since they are the base images in question.

Functionally, consuming a tarball in this way isn't that much different from consuming a raw tarball that's part of, say, an OCI image layout (https://github.com/opencontainers/image-spec/blob/v1.0.2/image-layout.md) -- it's some tarball plus some metadata about what to do with it.

For less trivial images, there's a significant difference (and I'm not proposing to use this for anything beyond simple one-layer base images), but for a single layer this would be basically identical.

As a more specific use case, the Debian rootfs.tar.xz files are currently 100% reproducible. Unfortunately, some of that gets lost when it gets imported into Docker, and thus it takes some additional effort to get from the Docker-generated rootfs back to the original debuerreotype-generated file.

This adds the ability to consume an OCI image directly, to go even further and have a 100% fully reproducible image digest as well, which makes it easier to trace a given published image back to the reproducible source generated by the upstream tooling (especially if a given image is also pushed by the maintainer elsewhere).

Here's an example oci-debian file I was using for testing this:

Maintainers: Foo (@bar)
GitRepo: https://github.com/tianon/docker-debian-artifacts.git
GitFetch: refs/heads/oci-arm32v5
Architectures: arm32v5
GitCommit: d6ac440e7760b6b16e3d3da6f2b56736b9c10065
Builder: oci-import
File: index.json

Tags: bullseye, bullseye-20221114, 11.5, 11, latest
Directory: bullseye/oci

Tags: bullseye-slim, bullseye-20221114-slim, 11.5-slim, 11-slim
Directory: bullseye/slim/oci

@tianon tianon force-pushed the oci-import branch 2 times, most recently from 5d57b56 to e99238e Compare December 6, 2022 19:13
@tianon
Copy link
Member Author

tianon commented Dec 6, 2022

A file the shows off how I discovered that symlinks weren't working:

Maintainers: Foo (@bar)
GitRepo: https://github.com/tianon/docker-debian-artifacts.git
GitFetch: refs/heads/oci-arm32v5
Architectures: arm32v5
GitCommit: f9093e68545e964f67bb3b516017d8ec2033ec71
Builder: oci-import
File: index.json

Tags: bullseye, bullseye-20221114, 11.5, 11, latest
Directory: bullseye/oci

(everything under https://github.com/tianon/docker-debian-artifacts/tree/f9093e68545e964f67bb3b516017d8ec2033ec71/bullseye/oci/blobs/sha256 is symlinks to more usefully-named files so that browsing the directory via GitHub is more user friendly)

Comment on lines 125 to 134

for _, tag := range r.Tags(namespace, uniq, entry) {
fmt.Printf("Tagging %s\n", tag)
if !dryRun {
err := dockerTag(cacheTag, tag)
if err != nil {
return cli.NewMultiError(fmt.Errorf(`failed tagging %q as %q`, cacheTag, tag), err)
}
}
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Making this part of docker build has the side effect that if we skip docker build (because bashbrew/cache:xxx exists and is up-to-date) it might not get all the tags it's supposed to have. 😩

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see three primary options here:

  1. revert this part of the change and go back to docker build + docker tag in a loop
  2. rely on docker build cache exclusively
    • noting that we added this indirection specifically because that cache was horrifyingly slow at the large scale our builders typically run
  3. when our bashbrew cache exists already, only then do the docker tag loop
    • on Windows, it would be faster and more reliable to docker inspect each one before we docker tag it (and docker tag often fails, so it'd be useful to have it auto-retry once or twice, but only on Windows 😭)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we also could:

  1. do a basic docker build that we pass just FROM bashbrew/cache:xxx to abuse docker build as a multi-docker tag 😅

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(To expound, a legit case where this gets hit is that we push a build, it works, it finishes, and then we add a new tag or change a tag and this now will fail pushing.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assumption: docker build -t '' -t '' ... is faster than multiple docker tag, especially for Windows.

I think using docker build to get multiple tags at once is nice. It could speed up builds (and maybe help windows fail less?). I'm not sure about using a single line FROM Dockerfile; it feels bad, like something that could be inadvertently broken by buildkit in the future.

As far as number two, we know that classic docker build cache is slow, so we can revisit this possibility once we use buildkit by default and we can try relying on its cache (I'll start testing it).

On windows we should do the inspect to see if that specific tag is already correct. 🙈

So I guess my conclusion is to probably use the hacky single line FROM Dockerfile to do a multi-tag. If it breaks in the future we can hopefully just rely on buildkit build cache (or maybe convince docker tag to take multiple targets).

@codecov-commenter
Copy link

codecov-commenter commented Dec 10, 2022

Codecov Report

Merging #61 (449eb48) into master (18db6c5) will decrease coverage by 7.66%.
The diff coverage is 23.15%.

@@            Coverage Diff             @@
##           master      #61      +/-   ##
==========================================
- Coverage   80.77%   73.10%   -7.67%     
==========================================
  Files           6        7       +1     
  Lines         619      714      +95     
==========================================
+ Hits          500      522      +22     
- Misses         94      162      +68     
- Partials       25       30       +5     
Impacted Files Coverage Δ
pkg/gitfs/fs.go 23.15% <23.15%> (ø)

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

@tianon tianon force-pushed the oci-import branch 2 times, most recently from 035f938 to b2b8385 Compare December 12, 2022 23:16
@tianon
Copy link
Member Author

tianon commented Dec 12, 2022

(last push was just a rebase)

cmd/bashbrew/docker.go Outdated Show resolved Hide resolved
In the case of base images (`debian`, `alpine`, `ubuntu`, etc), using a `Dockerfile` as our method of ingestion doesn't really buy us very much.  It made sense at the time it was implemented ("all `Dockerfile`, all the time"), but at this point they're all some variation on `FROM scratch \n ADD foo.tar.xz / \n CMD ["/bin/some-shell"]`, and cannot reasonably be "rebuilt" when their base image changes (which is one of the key functions of the official images) since they _are_ the base images in question.

Functionally, consuming a tarball in this way isn't _that_ much different from consuming a raw tarball that's part of, say, an OCI image layout (https://github.com/opencontainers/image-spec/blob/v1.0.2/image-layout.md) -- it's some tarball plus some metadata about what to do with it.

For less trivial images, there's a significant difference (and I'm not proposing to use this for anything beyond simple one-layer base images), but for a single layer this would be basically identical.

As a more specific use case, the Debian `rootfs.tar.xz` files are currently [100% reproducible](https://github.com/debuerreotype/debuerreotype).  Unfortunately, some of that gets lost when it gets imported into Docker, and thus it takes some additional effort to get from the Docker-generated rootfs back to the original debuerreotype-generated file.

This adds the ability to consume an OCI image directly, to go even further and have a 100% fully reproducible image digest as well, which makes it easier to trace a given published image back to the reproducible source generated by the upstream tooling (especially if a given image is also pushed by the maintainer elsewhere).

Here's an example `oci-debian` file I was using for testing this:

    Maintainers: Foo (@bar)
    GitRepo: https://github.com/tianon/docker-debian-artifacts.git
    GitFetch: refs/heads/oci-arm32v5
    Architectures: arm32v5
    GitCommit: d6ac440e7760b6b16e3d3da6f2b56736b9c10065
    Builder: oci-import
    File: index.json

    Tags: bullseye, bullseye-20221114, 11.5, 11, latest
    Directory: bullseye/oci

    Tags: bullseye-slim, bullseye-20221114-slim, 11.5-slim, 11-slim
    Directory: bullseye/slim/oci
Comment on lines +61 to +63
if strings.HasPrefix(target, "../") {
return "", fmt.Errorf("unsupported: %q is a relative symlink outside the tree (%q)", f.Name, target)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This probably should be above the path.Join? And maybe a check on path.Clean?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It needs to be below the path.Join because a relative symlink to a directory outside of its directory but still within the Git repository is valid and safe (it's only leaving the Git repository that isn't safe and would actually error later so this is still a best-effort check). I'll add some comments to make that more clear.

That being said, path.Clean is a good call-out - that would make this check's implementation safer. 😅

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Import OCI (base) images as-is
3 participants