Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[v0.13] fail to export image when using rewrite-timestamp=true #4793

Closed
emalihin opened this issue Mar 21, 2024 · 18 comments · Fixed by #5008
Closed

[v0.13] fail to export image when using rewrite-timestamp=true #4793

emalihin opened this issue Mar 21, 2024 · 18 comments · Fixed by #5008

Comments

@emalihin
Copy link

emalihin commented Mar 21, 2024

Hello,

I've been running BuildKit 0.13.0-beta1 for a while as a docker-container driver to get caching ECR registry integration and timestamp rewriting for reproducibility. This worked well for a couple of months.

Today I tried upgrading BuildKit to the stable 0.13.0 and 0.13.1 releases with the same docker-container setup, and started encountering these errors:

...
#18 [10/10] COPY test.conf /etc/test.conf
#18 CACHED
#19 exporting to image
#19 exporting layers done
#19 rewriting layers with source-date-epoch 1 (1970-01-01 00:00:01 +0000 UTC)
------
 > exporting to image:
------
ERROR: failed to solve: content digest sha256:7b722c1070cdf5188f1f9e43b8413157f8dfb2b4fe84db3c03cb492379a42fcc: not found

For some reason jobs that use this cached layer fail to export it.
Once I delete the cache from ECR - 1st build passes, and subsequent builds fail.

Everything works again once I revert to using BuildKit 0.13.0-beta1, with new and existing caches.

This is the command I use:

docker buildx build \
--build-arg SOURCE_DATE_EPOCH=1 \
--builder buildkit-6f9724f17cbb483a5f1e \
--provenance=false \
--sbom=false \
--cache-from type=registry,ref=0000000000.dkr.ecr.us-east-2.amazonaws.com/image-name:branch-name-cache \
--cache-from type=registry,ref=0000000000.dkr.ecr.us-east-2.amazonaws.com/image-name:main-cache \
--cache-to mode=max,image-manifest=true,oci-mediatypes=true,type=registry,ref=0000000000.dkr.ecr.us-east-2.amazonaws.com/image-name:branch-name \
--output type=registry,rewrite-timestamp=true \
--tag 0000000000.dkr.ecr.us-east-2.amazonaws.com/image-name:tmp.4521fd86 .

Edit: similar to this issue, but I'm not doing COPY --link

@tonistiigi tonistiigi changed the title BuildKit 0.13.0 and 0.13.1 fail to export image after rewriting layers epoch [v0.13] fail to export image when using rewrite-timestamp=true Mar 21, 2024
@tonistiigi tonistiigi added this to the v0.13.0 milestone Mar 21, 2024
@tonistiigi tonistiigi modified the milestones: v0.13.0, v0.13.1 Mar 21, 2024
@AkihiroSuda
Copy link
Member

Please attach a (minimal and yet self-contained) reproducer

@emalihin
Copy link
Author

Admittedly I cannot reproduce this in a minimal self-contained environment. Closing the issue until I figure out what's up.

@massimeddu-sj
Copy link

massimeddu-sj commented Mar 27, 2024

We have a very similar issue with v0.13.1 with command:

buildctl --debug build --progress plain --frontend dockerfile.v0
              --local context=. --local dockerfile=clients/web/
              --export-cache mode=max,image-manifest=true,oci-mediatypes=true,type=registry,ref=XXXXXX.dkr.ecr.eu-west-1.amazonaws.com/YYYYY:TAG-v0.13.1,ignore-error=true,compression=estargz,rewrite-timestamp=true
              --import-cache type=registry,unpack=true,ref=XXXXXX.dkr.ecr.eu-west-1.amazonaws.com/YYYYY:latest-cache-v0.13.1
              --import-cache type=registry,unpack=true,ref=XXXXXX.dkr.ecr.eu-west-1.amazonaws.com/YYYYY:TAG-v0.13.1
              --opt build-arg:ARCH=amd64
              --opt build-arg:REGISTRY=internal-registry.ZZZZZ.net:5000/library/
              --opt build-arg:SOURCE_DATE_EPOCH=1707129707
              --opt target=deploy
              --opt filename=./Storybook.Dockerfile
              -o type=image,"name=XXXXXX.dkr.ecr.eu-west-1.amazonaws.com/YYYYY:change-c06b70bf98322eb59b6f58d86ea67ba4,XXXXXX.dkr.ecr.eu-west-1.amazonaws.com/YYYYY:dev-TAG,XXXXXX.dkr.ecr.eu-west-1.amazonaws.com/YYYYY:dev-TAG",push=true,compression=estargz,rewrite-timestamp=true

Stack trace:

google.golang.org/grpc.getChainUnaryInvoker.func1
13:05:20  	/src/vendor/google.golang.org/grpc/clientconn.go:519
13:05:20  go.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/otelgrpc.UnaryClientInterceptor.func1
13:05:20  	/src/vendor/go.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/otelgrpc/interceptor.go:110
13:05:20  github.com/moby/buildkit/client.New.filterInterceptor.func5
13:05:20  	/src/client/client.go:387
13:05:20  google.golang.org/grpc.DialContext.chainUnaryClientInterceptors.func3
13:05:20  	/src/vendor/google.golang.org/grpc/clientconn.go:507
13:05:20  google.golang.org/grpc.(*ClientConn).Invoke
13:05:20  	/src/vendor/google.golang.org/grpc/call.go:35
13:05:20  github.com/moby/buildkit/api/services/control.(*controlClient).Solve
13:05:20  	/src/api/services/control/control.pb.go:2234
13:05:20  github.com/moby/buildkit/client.(*Client).solve.func2
13:05:20  	/src/client/solve.go:274
13:05:20  golang.org/x/sync/errgroup.(*Group).Go.func1
13:05:20  	/src/vendor/golang.org/x/sync/errgroup/errgroup.go:75
13:05:20  runtime.goexit
13:05:20  	/usr/local/go/src/runtime/asm_amd64.s:1650
13:05:20  
13:05:20  github.com/moby/buildkit/client.(*Client).solve.func2
13:05:20  	/src/client/solve.go:290
13:05:20  golang.org/x/sync/errgroup.(*Group).Go.func1
13:05:20  	/src/vendor/golang.org/x/sync/errgroup/errgroup.go:75
13:05:20  runtime.goexit
13:05:20  	/usr/local/go/src/runtime/asm_amd64.s:1650

@emalihin did you find a solution for your problem?

@emalihin
Copy link
Author

emalihin commented Mar 27, 2024

I've rolled back to 0.13.0-beta1 for now. Are both of your --import-cache already populated? I might have missed this step trying to reproduce in an isolated env..

@massimeddu-sj
Copy link

massimeddu-sj commented Mar 27, 2024

I've rolled back to 0.13.0-beta1 for now. Are both of your --import-cache already populated? I might have missed this step trying to reproduce in an isolated env..

I see. In this specific example latest-cache-v0.13.1 is populated while TAG-v0.13.1 is not. But I was able to reproduce the issue also with both caches populated.

Actually I notice this start happening after upgrade from moby/buildkit:v0.13.0-beta3 to v0.13.0 (and also I see this in moby/buildkit:v0.13.1, so I'm going to revert to v.0.13.0-beta3 and see if it solves the issue.

My guess is that somehow the cache is "poisoned", so when a cache get the issue, I'll continue to see this happening until I manually delete the cache. Also I build multiple images in parallel, and some images have the issue and some other not, but I'm not able to find any pattern in the behavior.

@AkihiroSuda
Copy link
Member

Might be a regression in #4663 (v0.13.0-rc2) ?

@emalihin
Copy link
Author

emalihin commented Mar 27, 2024

My guess is that somehow the cache is "poisoned", so when a cache get the issue, I'll continue to see this happening until I manually delete the cache.

In my testing of 0.13.0 and 0.13.1 first cache write worked, but first and following cache reads failed. However when I went back to 0.13.1-beta1 - cache reads worked, even with the caches written by 0.13.0 and 0.13.1.

@massimeddu-sj
Copy link

Might be a regression in #4663 (v0.13.0-rc2) ?

I'm testing v0.13.0-rc1 and currently I'm not seeing the issue (but previously I noticed that could happen after a while, so I'm not 100% secure that rc1 doesn't have this issue).

My guess is that somehow the cache is "poisoned", so when a cache get the issue, I'll continue to see this happening until I manually delete the cache.

In my testing of 0.13.0 and 0.13.1 first cache write worked, but first and following cache reads failed. However when I went back to 0.13.1-beta1 - cache reads worked, even with the caches written by 0.13.0 and 0.13.1.

I didn't check that. Our CI generates different cache tags if we change the buildkit version, so I'm not able to test this scenario.

@AkihiroSuda
Copy link
Member

This PR may fix your issue too

@massimeddu-sj
Copy link

This PR may fix your issue too

Happy to test it as soon as a build is available. Thank you.

@Manbeardo
Copy link

I found a way to repro it on 0.13.2. Dangling (unreferenced) stages appear to cause problems when using rewrite-timestamp=true.

Example Dockerfile:

FROM debian:bookworm as base

FROM debian:bookworm-slim

RUN echo "foo"

Might need to reopen this or file a new issue?

@AkihiroSuda AkihiroSuda reopened this May 11, 2024
@AkihiroSuda AkihiroSuda modified the milestones: v0.13.1, v0.13.3 May 11, 2024
@AkihiroSuda
Copy link
Member

@Manbeardo

Thanks for reporting, but I can't repro the issue with BuildKit v0.13.2 on Ubuntu 24.04

sudo buildctl build --frontend dockerfile.v0 --local dockerfile=. --local context=. --output type=oci,name=build-0,dest=/tmp/build-0.tar,dir=false,rewrite-timestamp=true --no-cache

@AkihiroSuda AkihiroSuda removed this from the v0.13.3 milestone May 11, 2024
@Manbeardo
Copy link

Manbeardo commented May 13, 2024

Here's a script to more precisely repro what I've seen (using a docker-container buildx driver at v0.13.2):

#!/usr/bin/env bash
set -xeuo pipefail

cd "$(mktemp -d)"

cat >Dockerfile <<'EOF'
FROM debian:bookworm-slim AS base
FROM debian:bookworm
EOF

OUTPUT_IMAGE="<IMAGE FROM PRIVATE REGISTRY>"
CACHE_IMAGE="<DIFFERENT IMAGE FROM PRIVATE REGISTRY>"

docker buildx build \
    --pull \
    --output "name=$OUTPUT_IMAGE,push=true,rewrite-timestamp=true,type=image" \
    --cache-to "ref=$CACHE_IMAGE,image-manifest=true,mode=max,oci-mediatypes=true,type=registry" \
    --platform linux/arm64/v8,linux/amd64 \
    .

That yields this error:

------
 > exporting to image:
------
ERROR: failed to solve: failed to push [REDACTED PRIVATE IMAGE NAME]: content digest sha256:60bdaf986dbe787297bb85c9f6a28d13ea7b9608b95206ef7ce6cdea50cd5505: not found

This can probably be reduced to a more minimal repro, but this was the first config that I got a repro out of as I gradually added configuration from my full build pipeline. My build pipeline had a 100% failure rate when running on my AWS CI builders, but appears to fail inconsistently (race condition?) when I build on my macbook.

Either way, removing the unused stage from the Dockerfile eliminates the error.

@thaJeztah
Copy link
Member

could that be related to containerd/containerd#10187 @AkihiroSuda ?

@AkihiroSuda
Copy link
Member

Minimal repro:

sudo buildctl build --frontend dockerfile.v0 --local dockerfile=. --local context=. --no-cache --output type=image,name=localhost:5000/foo,push=true,rewrite-timestamp=true

(Needs push=true && rewrite-timestamp=true)

I also see an occasional panic in the daemon log

WARN[2024-05-15T02:44:50+09:00] rewrite-timestamp is specified, but no source-date-epoch was found  spanID=cbbd84754ca443c5 traceID=a347fa5861b3af65b3508699ef6f1797
panic: send on closed channel

goroutine 671 [running]:
github.com/containerd/containerd/remotes/docker.(*pushWriter).setPipe(...)
        /home/suda/gopath/src/github.com/moby/buildkit/vendor/github.com/containerd/containerd/remotes/docker/pusher.go:364
github.com/containerd/containerd/remotes/docker.dockerPusher.push.func1()
        /home/suda/gopath/src/github.com/moby/buildkit/vendor/github.com/containerd/containerd/remotes/docker/pusher.go:286 +0xcc
github.com/containerd/containerd/remotes/docker.(*request).do(0xc000d5b5f0, {0x295a620, 0xc000581410})
        /home/suda/gopath/src/github.com/moby/buildkit/vendor/github.com/containerd/containerd/remotes/docker/resolver.go:556 +0x162
github.com/containerd/containerd/remotes/docker.(*request).doWithRetries(0xc000d5b5f0, {0x295a620, 0xc000581410}, {0x0, 0x0, 0x0})
        /home/suda/gopath/src/github.com/moby/buildkit/vendor/github.com/containerd/containerd/remotes/docker/resolver.go:600 +0x47
github.com/containerd/containerd/remotes/docker.dockerPusher.push.func2()
        /home/suda/gopath/src/github.com/moby/buildkit/vendor/github.com/containerd/containerd/remotes/docker/pusher.go:292 +0x57
created by github.com/containerd/containerd/remotes/docker.dockerPusher.push in goroutine 787
        /home/suda/gopath/src/github.com/moby/buildkit/vendor/github.com/containerd/containerd/remotes/docker/pusher.go:291 +0x2150

@thompson-shaun thompson-shaun modified the milestones: v0.13.3, v0.14.0, v0.future May 30, 2024
@thompson-shaun
Copy link
Collaborator

Any movement or changes to this @AkihiroSuda?

@AkihiroSuda
Copy link
Member

Any movement or changes to this @AkihiroSuda?

Not yet, PR is welcome

@AkihiroSuda
Copy link
Member

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants