Allow leaf-node image in dependency graph to be built/tested/published on its own #186

mthalman · 2019-05-22T21:28:19Z

Currently all the images in a dependency graph must be built, tested, and published as a set. As an example, this means we can't make a change to the SDK image without also building the runtime-deps image and that means the runtime-deps image will also get published even though it hasn't necessarily changed. Similarly, the SDK image can't be tested without also building the ASP.NET image because the tests require it. This prevents us from making more fine-grained releases. It forces us to publish images that haven't even been changed.

This should be fixed by allowing a leaf-node image to be built on its own. Any dependencies it has should be retrieved from the publicly released images in MCR.

mthalman · 2019-09-23T18:57:00Z

This is not fixed. For example, you cannot build only https://github.com/dotnet/dotnet-docker/blob/master/3.0/sdk/alpine3.9/amd64/Dockerfile in an official build . It will attempt to reference the runtime image from the staging location instead of the MCR.

mthalman · 2019-11-05T16:21:57Z

The changes needed to implement this in Image Builder require a complicated command interface. The ability to build a partial dependency graph would be better done using Docker's build cache. We'll be deferring to that feature to get this functionality instead.

mthalman · 2020-07-31T14:34:22Z

Reopening this because I feel this is important to have working as part of dotnet/dotnet-docker#2122. Without these changes, a release of 3.1 would require republishing 5.0 and vice versa.

mthalman · 2020-07-31T19:34:45Z

Options

Option 1: Use Image Info Data

This option makes use of the image info data that is available from prior builds to determine whether a Dockerfile needs to be built. Specifically, the Dockerfile commit SHA and base image digest values contained in this data indicate whether the output of the build would produce a different image compared to the previously built version of the image.

Example:
A build to produce a new release of .NET Core 3.1 Docker images is being executed. It first starts with the runtime-deps Dockerfile, let's say runtime-deps:alpine3.12. Image Builder looks up the image info data for that Dockerfile from when it was last built. The Dockerfile commit SHA from the image info data is compared to the current SHA of the Dockerfile being built. Let's say they're the same in which case the Dockerfile has not changed since it was last built. The next step is to check whether the base image had changed. Image Builder queries to determine what the digest of the current alpine:3.12 image is on Docker Hub and compares that to the base image digest stored in the image info data. Let's say they're the same as well. In this case, because both the Dockerfile commit SHA and the base image digest values are the same compared to what they were when the image was last built, there's no need to build runtime-deps:alpine3.12 again. The build then moves on to the runtime Dockerfile. Since this is a servicing release of .NET Core 3.1, the Dockerfile has changed since it was last built. The commit SHAs are different in that case and the runtime Dockerfile will be built. And so on with the other repos.

Pros:

Universal solution that can work with Linux or Windows.
Pulls only the images that are needed. Avoids pulling an image in cases where a diff is detected.
Simplicity in not needing to know whether an image has changed at queue time. We can queue up a build to build everything and only the stuff that changed would be published.
Independent of the builder engine being used.

Cons:

Complicates the build infrastructure.
May end up producing new layers that could have been reused from a previously built image. In some scenarios, layer-by-layer caching could allow individual layers to be reused from a cached image even if the entire cached image didn't match the one be built. But this approach doesn't have that level of granularity; it applies to only matching the entire image as a whole.

Implementation cost: large

Option 2: Use External Cache Source Feature

External cache sources allow the builder to reuse layers that were generated from previous builds of an image stored in a registry.

Implementation cost: small

Option 2a: BuildKit builder engine

Usage:

$ export DOCKER_BUILDKIT=1
$ docker build -t myname/myapp --build-arg BUILDKIT_INLINE_CACHE=1 .
$ docker push myname/myapp

# on another machine
$ docker build --cache-from myname/myapp .

Pros:

Simplicity in not needing to know whether an image has changed. The builder figures that out. We can queue up a build to build everything and only the stuff that changed would be published.
Provides layer-by-layer caching.
Incremental pulling of cached layers.
Supports caching of layers from intermediate build stages.

Cons:

BuildKit is not supported on Windows: Unable to use Buildkit with Windows containers moby/buildkit#616. There has been recent activity on enabling this however.
Minor: Requires cache metadata to exist on published images so we first need to publish an image that has this metadata before it can be used as a cache source.
Cache misses result in layers being pulled that are unused.

Option 2b: Docker builder engine

Usage:

$ docker push myname/myapp

# on another machine
$ docker build --cache-from myname/myapp .

Pros:

Simplicity in not needing to know whether an image has changed. The builder figures that out. We can queue up a build everything and only the stuff that changed would be published.
Provides layer-by-layer caching.
Enabled by default on build agents.

Cons:

Doesn't support Windows images.
Requires pulling the cache source image before building.
Does not cache layers from intermediate build stages. This results in wasted compute to rebuild the intermediate stages that might end up resulting in a cache hit in the final stage anyway. A workaround for this problem would be to publish the results of the intermediate stages in a private registry that could be referenced as a cache source.
Cache misses result in layers being pulled that are unused.

Option 3: Partial Build Graph Support

This option allows builds to be executed that explicitly define which portions of the Docker image graph are to be built.

Examples:

A new version of .NET Core 3.1 is being released and we know that runtime-deps has not changed. In that case, we can execute a build that only builds runtime, aspnet, and sdk.

A new version of the .NET Core 3.1 SDK is being released with no change to the runtimes. In that case, we can execute a build that only builds sdk.

While conceptually this is a simple process, to make this a clean implementation it actually requires a significant change to the infrastructure to be able to express this kind of partial graph.

Pros:

Avoids even attempting to build portions of the graph that aren't needed, saving on build agent usage.
Pulls only the images that are needed.
Independent of the builder engine being used.

Cons:

Requires knowing which portions of the graph need to be built. This is prone to human error.
Complicates the infrastructure to be able to express this kind of partial graph in a build request.
Only works for manually queued builds. Wouldn't apply very well to nightly builds as a result.

Implementation cost: large

Package Differences

In all of these options, there's one aspect that's missing: detecting changes to installed packages. One of the benefits of rebuilding runtime-deps for each release of .NET Core is that will ensure that the latest package versions are available in that image. None of the options presented above can detect this without actually rebuilding the layer that installs the package which defeats the whole purpose being sought here.

Basically, this is an orthogonal issue that would be made worse by implementing these changes. There's a separate issue for tracking the package update problem: dotnet/dotnet-docker#1455.

Proposed Solution

I'm proposing that we go with option 1. Compared to option 3, it certainly seems like a better option because the detection of diffs is automated, it would have a cleaner UX, and the implementation cost would probably be a wash between the two options. For options 2a and 2b, the lack of Windows support is problematic. Even though progress is being made to add support to Windows, that's still a ways off because of dependencies needed in Windows containers. Option 1 seems to have the least downside.

MichaelSimons · 2020-07-31T20:31:33Z

Nice proposal write-up @mthalman. I have a couple questions/comments.

I understood what you meant by the second con of the first option but others may not. The point that is not clear is that this option "caches" an entire image and does not work for individual layers. When an entire image cannot be reused, there are scenarios where the beginning layers of an image could be shared/reused.
What part of the 2a option is in experimental mode? BuildKit itself is not experimental. It is just not the default builder. There is some risk switching to use BuildKit in general. I know there is not 100% parity in functionality but when I last looked at the feature disparity list, nothing jumped out that would affect our usage.
Option 3 Con 1 - to me it would be useful to call out this is prone to human error.
Option 3 - a con this option has over the others is related to overall benefit. The other options will provide general value outside of this specific issue. For instance they will be extremely valuable to the nightly builds. The majority of the Dockerfile and base images are not changing each time nightly is built today. Having a caching mechanism in place has the potential to have a pretty sizeable impact on the build performance both in terms of resources used and the time to build.

mthalman · 2020-08-03T12:45:18Z

I've clarified the cons as you pointed out. You're right about BuildKit not being experimental; I misread the documentation on that.

MichaelSimons · 2020-08-06T16:39:51Z

@mthalman - Can you provide a more detailed breakdown on the cost of option 1 and if there are options within it?

mthalman · 2020-08-10T15:58:19Z

Option 1 could be implemented in 3 phases. The first phase would provide the complete level of caching that we're looking to obtain from this option but it would have some inefficiency due to potential wasted time spent doing unnecessary testing. The second phase would optimize the testing to only test what is needed. A third phase would optimize the build to trim any build jobs that would result in not producing any new images.

Proposed Design

Phase 1

Build

The build command in Image Builder should look up the status of each Dockerfile it needs to build by comparing the digest of the most recent base image to what is stored in the image info file as well as comparing the commit SHA of the Dockerfile with the one that is store din the image info file. If either of those values differ, the Dockerfile is built. Otherwise, the associated image tag is pulled and tagged as if it were locally built. Subsequent dependent Dockerfiles that are built will then be based off the locally built image or the cached image that was pulled. All images are pushed to the staging repo whether they are built locally or not.

The build command also outputs an image info file. If a cache hit occurs, that Dockerfile/image is not added to the image info file. Only new images that are produced by the build are to be included in the image info file.

Test

This phase proposes no changes to the amount of testing that is done. This means that any images that were result of a cache hit are still tested. This amounts to wasted testing effort since those images were tested when they were originally published. But in the interest of getting a working caching solution up and running quickly, this seems like a reasonable compromise in the short term. Phase 2 details how to resolve this testing inefficiency.

When tests are run by the build infra, the PullImages option is used, causing the tests to pull any required images. The issue is that for any cached images, it will attempt to pull from the mcr.microsoft.com location instead of the staging location because the cache image will not be included in the image info. This is problematic because there's no guarantee that the version of the image pulled from mcr.microsoft.com is the same version that exists in the staging location (i.e. the version that any dependent images were based on). The reason that this logic is implemented to pull from mcr.microsoft.com is to handle scenarios like a build that solely updates sdk:3.1-buster and the tests require the runtime and aspnet images but those were not built by the build so these images must be pulled from mcr.microsoft.com. So the difference here with cached images is the Dockerfile dependency. The sdk:3.1-buster Dockerfile has no dependency on runtime/aspnet so pulling from mcr.microsoft.com is necessary. But if a Dockerfile has a dependency on a base image that was the result of a cache hit and not in the image info, logic needs to specially handle that case and pull from the staging repo instead of mcr.microsoft.com.

Publish

The publish stage should continue to work without any changes since it will only be processing the images that are included in the image info file.

Estimate

2-3 days

Phase 2

This phase of work is just about resolving the issue of testing images that do not need to be tested. Phase 1 allowed for previously published images to be pulled and reused but the testing stage would continue to test all images. This is inefficient because images had already been tested when they were originally published.

One of the goals in resolving the issue is to determine what needs to be tested when the test matrix is generated in order to avoid spinning up test jobs that may just result in no images being tested. In order to do that, the matrix generation logic will require the image info that was output by the build. That determines which images were built and, when compared against the dependency graph of the Dockerfiles, it can be determined which images are actually cached versions.

By providing to the generateBuildMatrix command a set of values that map Dockerfile path prefixes to test categories, the logic can output the set of test categories that should be passed to the test script.

For example, consider a call to the generateBuildMatrix command with the following subset of parameters:

--dockerfile-test-category 'runtime-deps=src/runtime-deps/*' --dockerfile-test-category 'runtime=src/runtime/*' --dockerfile-test-category 'aspnet=src/aspnet/*' --dockerfile-test-category 'sdk=src/sdk/*'

And let's say that the only new image produced by the build was sdk. The test matrix generation uses the image info to determine that only the sdk image was built by this build. The matrix output that it generates the specifies a test-categories field that indicates which categories should be passed to and executed by the test script, in this case "sdk".

Example representation of a leg of the test matrix:

    5.0-focal:
      legName: linuxarm64v85.0-focal
      osType: linux
      architecture: arm64
      osVersions: --os-version focal
      dotnetVersion: 5.0
      osVariant: focal
      test-categories: sdk

If a test leg results in there being no test categories due to all of the images being the result of cache hits, that test leg is excluded entirely from the matrix. This prevents the execution of test jobs that end up not needing to test anything.

At this point, the test logic executes as it always has since it already supports the ability to test subsets of images based on test categories.

Estimate

2 days

Phase 3

This phase would optimize the build for cases where a particular build job would end up not producing any new images because all the images it would have been result in cache hits. To do this, the same logic that the build command uses to determine whether an image needs to be built would be executed as part of the build matrix generation. When that logic is run over the entire set of Dockerfiles, any build legs that do not have any Dockerfiles to be built can simply be trimmed from the matrix. This avoids executing build jobs that would end up not building any new images. This is only intended as an optimization for trimming out such jobs. The build command would still be responsible for calculating which images need to be built on its own.

Estimate

1 day

mthalman self-assigned this May 22, 2019

mthalman changed the title ~~Allow leaf-node image in dependency graph to be built/tested/published~~ Allow leaf-node image in dependency graph to be built/tested/published on its own May 22, 2019

MichaelSimons mentioned this issue May 22, 2019

Provide support for SDK-only releases microsoft/dotnet-framework-docker#253

Closed

MichaelSimons added area-infrastructure enhancement triaged labels May 29, 2019

mthalman mentioned this issue Aug 20, 2019

Build matrix fix: Don't include dependencies if not in context #277

Merged

mthalman closed this as completed in #277 Aug 21, 2019

mthalman reopened this Sep 23, 2019

mthalman closed this as completed Nov 5, 2019

mthalman reopened this Jul 31, 2020

mthalman removed the triaged label Jul 31, 2020

MichaelSimons added the triaged label Aug 6, 2020

mthalman mentioned this issue Aug 11, 2020

Stop building/testing the full repo during PR validation. dotnet/dotnet-docker#2161

Closed

MichaelSimons mentioned this issue Aug 11, 2020

Consider changing the dotnet-core-nightly build to run on a schedule vs on each commit. dotnet/dotnet-docker#2162

Closed

mthalman mentioned this issue Aug 13, 2020

Store digests of internal FROM images in image info #600

Merged

mthalman mentioned this issue Aug 20, 2020

Build caching logic for Build command #610

Merged

markwilkie mentioned this issue Aug 24, 2020

Use (newish) 1ES data to help with reliability numbers (and maybe more) dotnet/arcade#6023

Closed

This was referenced Aug 26, 2020

Enable build cache from pipeline #618

Merged

Override testResultsDirectory in case of multiple repos #621

Merged

This was referenced Sep 2, 2020

Test trimming based on build cache results #632

Closed

Build job optimization based on build cache #633

Open

mthalman closed this as completed Sep 29, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow leaf-node image in dependency graph to be built/tested/published on its own #186

Allow leaf-node image in dependency graph to be built/tested/published on its own #186

mthalman commented May 22, 2019 •

edited

Loading

mthalman commented Sep 23, 2019

mthalman commented Nov 5, 2019

mthalman commented Jul 31, 2020

mthalman commented Jul 31, 2020 •

edited

Loading

MichaelSimons commented Jul 31, 2020

mthalman commented Aug 3, 2020

MichaelSimons commented Aug 6, 2020

mthalman commented Aug 10, 2020 •

edited

Loading

Allow leaf-node image in dependency graph to be built/tested/published on its own #186

Allow leaf-node image in dependency graph to be built/tested/published on its own #186

Comments

mthalman commented May 22, 2019 • edited Loading

mthalman commented Sep 23, 2019

mthalman commented Nov 5, 2019

mthalman commented Jul 31, 2020

mthalman commented Jul 31, 2020 • edited Loading

Options

Option 1: Use Image Info Data

Option 2: Use External Cache Source Feature

Option 2a: BuildKit builder engine

Option 2b: Docker builder engine

Option 3: Partial Build Graph Support

Package Differences

Proposed Solution

MichaelSimons commented Jul 31, 2020

mthalman commented Aug 3, 2020

MichaelSimons commented Aug 6, 2020

mthalman commented Aug 10, 2020 • edited Loading

Proposed Design

Phase 1

Build

Test

Publish

Estimate

Phase 2

Estimate

Phase 3

Estimate

mthalman commented May 22, 2019 •

edited

Loading

mthalman commented Jul 31, 2020 •

edited

Loading

mthalman commented Aug 10, 2020 •

edited

Loading