Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cache-from working unreliably #1388

Closed
EugenMayer opened this issue Mar 2, 2020 · 20 comments · Fixed by #1568 or moby/moby#41234
Closed

cache-from working unreliably #1388

EugenMayer opened this issue Mar 2, 2020 · 20 comments · Fixed by #1568 or moby/moby#41234

Comments

@EugenMayer
Copy link

EugenMayer commented Mar 2, 2020

I'am building an image foo/app with the following aspects:

  • The image gets the tag :build-latest, :build-latest-master, build-nr-<buildID>

i build using

CACHE_FROM=foo/app:build-latest-master
TARGET_IMG=foo/app:build-nr-$CI_BUILDNR

docker build --label "ci.buildNumber=$CI_BUILDNR" --label "vcs.branch=$VCS_BRANCH" --label "vcs.commithash=$VCS_COMMITHASH" --build-arg BUILDKIT_INLINE_CACHE=1 --cache-from $CACHE_FROM -f Dockerfile -t $TARGET_IMG .

docker tag $TARGET_IMG foo/app:build-latest-master
docker tag $TARGET_IMG foo/app:build-latest

docker push $TARGET_IMG
docker push foo/app:build-latest
docker push foo/app:build-latest-master

Oddly, every time the cache is used, the produced image does not include the right layers, not the one that have been cached - it must be some very old or cut of layer cache.

The second odd thing is, that the cache is used only every second build. This means, that every second build my actual image is not broken but also it is not cached at all, and then my image is outdated (but cache is used).

Could it be that the layer cache is never updated in the registry at all even though i push docker push foo/app:build-latest-master since somehow the inline cache is only attached to foo/app:build-nr-$CI_BUILDNR and cannot be "tagged and pushed using a diffrent tag name" ?

Let me know if you need anything else or more specific to understand the issue

@thaJeztah
Copy link
Member

Could you provide a minimal Dockerfile to reproduce the issue?

@EugenMayer
Copy link
Author

EugenMayer commented Mar 9, 2020

Sorry for letting you wait - right now not going the extra step to create a repo-example for this. We debugged this and there is an issue with the cache 100%, removing buildkit here using legacy build fixed the cache issue right away. All we changed was

  • remove --build-arg BUILDKIT_INLINE_CACHE=1
  • remove Build env BUILDKIT=1
  • add extra docker pull $CACHE_FROM since the legacy build will need that

And everything works as design and expected - no cache layer issues, no fuzziness and no "suddenly outdated container".

If of any interest, we are building under azure ( ubuntu-latest ) using the exact steps shown above.

@epk
Copy link

epk commented Mar 23, 2020

+1 on this @thaJeztah

I enabled DOCKER_BUILDKIT=1 on our image builder agents at Shopify.

We have a monorepo containing several artifacts and all of them have this dockerfile:

FROM golang:1.13.7

WORKDIR /go/src/github.com/Shopify/repo
COPY . .

ARG APP_SHA
RUN GOFLAGS=-mod=vendor GOOS=linux CGO_ENABLED=0 \
  go build -trimpath -ldflags "-X github.com/Shopify/repo.Version=${APP_SHA}" -o /go/bin/ ./cmd/binaryA

FROM ubuntu:bb17c28d885454c5e59f5b09dc2a0771c4d53339
RUN apt-get update && apt-get install -y --no-install-recommends \
  ca-certificates \
  && rm -rf /var/lib/apt/lists/*

WORKDIR /app
COPY --from=0 /go/bin/binaryA ./bin/binaryA
COPY binaryA/config/* ./config/

ENTRYPOINT ["/app/bin/binaryA"]

This is what happened:

image:binaryA was built by an agent foo. Then the agent foo took on building image:binaryB, which is the same just these two lines are different

RUN GOFLAGS=-mod=vendor GOOS=linux CGO_ENABLED=0 \
  go build -trimpath -ldflags "-X github.com/Shopify/repo.Version=${APP_SHA}" -o /go/bin/ ./cmd/binaryB
COPY --from=0 /go/bin/binaryB ./bin/binaryB
COPY binaryB/config/* ./config/

With --cache-from on, the build for binaryB did this

[2020-03-23T21:04:11Z]  => CACHED [stage-1 3/5] WORKDIR /app                                      2.3s
[2020-03-23T21:04:11Z]  => => pulling sha256:eb0108cefe768746a45255273f9af291c560cc5bb7d095edeb8  0.3s
[2020-03-23T21:04:11Z]  => => pulling sha256:400459df132d9a62d13d3b8b05455292d3d9e981804617ffad9  0.2s
[2020-03-23T21:04:11Z]  => [stage-1 4/5] COPY --from=0 /go/bin/binaryA ./bin/binaryA          0.2s
[2020-03-23T21:04:11Z]  => [stage-1 5/5] COPY binaryA/config/* ./config/                        0.2s

This caused a minor incident, I had to disable --cache-from to fix it but could not look into it further

@EugenMayer
Copy link
Author

@epk i disabled BUILDKIT and using docker pull + --cache-from - this is slower in terms the buildkit build is usually faster, but not using cache-from is a far worse build-speed hit

@epk
Copy link

epk commented Mar 24, 2020

@epk i disabled BUILDKIT and using docker pull + --cache-from - this is slower in terms the buildkit build is usually faster, but not using cache-from is a far worse build-speed hit

In our case the multi-stage dockerfiles don't really benefit from --cache-from as the first stage needs to be built everytime and the second image is a common base image present on most agents already

@tonistiigi
Copy link
Member

@epk Don't understand your example. This seems exactly expected. Both images use same code to install ca-certificates then they will share cache for it. The cache is only shared up to the ca-certificates installation and COPY is invoked again because that binary is different between images.

@epk
Copy link

epk commented Mar 24, 2020

@epk Don't understand your example. This seems exactly expected. Both images use same code to install ca-certificates then they will share cache for it. The cache is only shared up to the ca-certificates installation and COPY is invoked again because that binary is different between images.

The build for binary B is copying binary A

[2020-03-23T21:04:11Z]  => [stage-1 4/5] COPY --from=0 /go/bin/binaryA ./bin/binaryA          0.2s
[2020-03-23T21:04:11Z]  => [stage-1 5/5] COPY binaryA/config/* ./config/                        0.2s

What's worth pointing out here is, these are static dockerfiles, except for the APP_SHA arg.

@tonistiigi
Copy link
Member

tonistiigi commented Mar 24, 2020

@epk hmm, this doesn't make much sense to me and needs a runnable reproduction. The remote cache was not applied to COPY --from=0 /go/bin/binaryA ./bin/binaryA line. It executed again, and if the binaryA didn't even exist in your build like you claim, it should have just failed.

Something like string COPY --from=0 /go/bin/binaryA ./bin/binaryA isn't even stored in the remote cache so it definitely needed to come from the fresh build.

@tonistiigi
Copy link
Member

@epk How do I run it. I think this issue you are hitting is #1368 and unrelated to --cache-from.

@epk
Copy link

epk commented Mar 30, 2020

@epk How do I run it. I think this issue you are hitting is #1368 and unrelated to --cache-from.

You are probably right about #1368, I encountered yet another repo where the using buildkit + cache from was resulting in missing files.

I will try and get a reproduction

@thaJeztah thaJeztah changed the title cache-from working unrelyably cache-from working unreliably Mar 30, 2020
@gaganpreet
Copy link

I spent quite some time trying to figure out why my buildkit images didn't have the files I copied and I reproduced the issue here: https://github.com/gaganpreet/buildkit-wrong-image/. It does not seem related to #1368 since there's only docker image being built here.

There's a Github CI action which runs in the repository and runs test.py within the built Docker image.

First run (no cache):

#4 importing cache manifest from gcr.io/***/buildkit-test:m...
#4 ERROR: gcr.io/***/buildkit-test:master not found

Successfully executes test.py at the end.

Second run (with cache):

#12 importing cache manifest from gcr.io/***/buildkit-test:m...
#12 DONE 1.6s

The cached build fails at the last step:

python: can't open file 'test.py': [Errno 2] No such file or directory

@AshDevFr
Copy link

AshDevFr commented Jul 14, 2020

I have also been encountering the issue where --cache-from causes missing files.

In the CI
I build an image (image_1) using buildkit without any cache. Then it pushes this image on the registry. It does contain all the files.

If I rerun the same CI job, it will find the previous image and rebuild the same image using image_1 as a cache (with --cache-from image_1).
Then it's pushing the new image (image_2) and I only get Layer already exists which is expected since it's only using cache.

On my machine
When I download both images, image_1 will be 2.2GB for example and image_2 will be 1.7GB
When I open both images, I can see that image_1 built without any cache does contain all the files.image_2 built using image_1 as a cache is missing a lot of files added to the image either by a COPY statement or with a curl

Example of the size after each build:
Image 1 (no cache)

digest: sha256:4a5274b2454671...79d07e1545798dfcf size: 7652
 $ docker image inspect --format='{{.Size}}' "${IMAGE_NAME}"
 2214346802
 $ docker image inspect --format='{{.VirtualSize}}' "${IMAGE_NAME}"
 2214346802
 $ docker image inspect --format='{{.RootFS.Layers}}' "${IMAGE_NAME}" | wc -w
 35

Image 2 (cache from image_1)

digest: sha256:fc97cc769a1bda4...467df9d9c819c994f size: 6601
 $ docker image inspect --format='{{.Size}}' "${IMAGE_NAME}"
 1833746632
 $ docker image inspect --format='{{.VirtualSize}}' "${IMAGE_NAME}"
 1833746632
 $ docker image inspect --format='{{.RootFS.Layers}}' "${IMAGE_NAME}" | wc -w
 30

@tonistiigi
Copy link
Member

@gaganpreet can you switch your repro to public repository. I can't see where the cache is being loaded from or verify its contents.

@gaganpreet
Copy link

gaganpreet commented Jul 15, 2020

@tonistiigi I now reproduced the issue in a public repository on Dockerhub.

I also added a bash script in my code repository, detailing the exact set of steps I ran to produce the two images.

I suspect the issue has something to do with the build args, since I tried to reproduce it by reducing the steps in the Dockerfile, but wasn't able to.

Edit: I also redirected the build output while running the script to a file.

@tonistiigi
Copy link
Member

fix in #1568 , please verify if you have hit this

@AshDevFr
Copy link

Hi @tonistiigi don't know if you still need more examples but I was able to reproduce our issue in this repo: https://gitlab.com/AshDevFr/buildkit_exp

Really simple example with fake files in it.

This pipeline is running when no cache already exist: https://gitlab.com/AshDevFr/buildkit_exp/-/pipelines/167184239
This one is the exact same code but the cache exist from the previous pipeline: https://gitlab.com/AshDevFr/buildkit_exp/-/pipelines/167184782

@tonistiigi
Copy link
Member

@AshDevFr Please check with the patch above and report back if it didn't fix the issue. The fix is based on the repro from @gaganpreet

thaJeztah added a commit to thaJeztah/docker that referenced this issue Jul 16, 2020
full diff: moby/buildkit@dc6afa0...4cb720e

- contenthash: ignore system and security xattrs in calculation
    - fixes moby/buildkit#1330 COPY cache not re-used depending on SELinux environment
    - fixes moby#39003 (comment)
- contenthash: allow security.capability in cache checksum
- inline cache: fix handling of duplicate blobs
    - fixes moby/buildkit#1388 cache-from working unreliably

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
docker-jenkins pushed a commit to docker-archive/docker-ce that referenced this issue Jul 17, 2020
full diff: moby/buildkit@dc6afa0...4cb720e

- contenthash: ignore system and security xattrs in calculation
    - fixes moby/buildkit#1330 COPY cache not re-used depending on SELinux environment
    - fixes moby/moby#39003 (comment)
- contenthash: allow security.capability in cache checksum
- inline cache: fix handling of duplicate blobs
    - fixes moby/buildkit#1388 cache-from working unreliably

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
Upstream-commit: 23d47bd12eaeeb93bbc4e9e80020c811e9eb2980
Component: engine
@ypadlyak
Copy link

ypadlyak commented Apr 9, 2021

@AshDevFr Have you been able to work around it, other than disabling Buildkit? We are fighting it today :(

@loyoan1
Copy link

loyoan1 commented Oct 14, 2021

I use Drone CI and also encountered the problem. Is there a fix?

@thaJeztah
Copy link
Member

I use Drone CI and also encountered the problem. Is there a fix?

This specific case was fixed; if you encounter an issue, please open a new ticket instead with details and exact steps to reproduce.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
8 participants