Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GitHub Actions cache fails with cache key: blob not found #681

Closed
jojomatik opened this issue Jul 20, 2021 · 21 comments
Closed

GitHub Actions cache fails with cache key: blob not found #681

jojomatik opened this issue Jul 20, 2021 · 21 comments

Comments

@jojomatik
Copy link

Copying from one stage to another failed in one of my repositories today in a Github Actions workflow after a minor change (jojomatik/blockcluster@cdeb83a) in a workflow file, which shouldn't affect docker at all: https://github.com/jojomatik/blockcluster/runs/3116972956

Dockerfile:62
--------------------
  60 |     RUN cd backend && npm install --only=prod
  61 |     
  62 | >>> COPY --from=builder /usr/games/blockcluster/dist dist
  63 |     COPY --from=builder /usr/games/blockcluster/backend/dist backend/dist
  64 |     COPY --from=builder /usr/games/blockcluster/.env ./
--------------------
error: failed to solve: failed to compute cache key: blob not found
Error: buildx failed with: error: failed to solve: failed to compute cache key: blob not found

I'm not sure where exactly this error message comes from and I wasn't able to pin it down with a google search either. So I'm not entirely sure if this is a buildx issue or an issue with one of its dependencies?

This is the relevant section in the workflow file: https://github.com/jojomatik/blockcluster/blob/cdeb83a457f28692cc162e2453a8fe4a36be4ec8/.github/workflows/publish.yml#L92-L111

I've since tried to update it to use the non-rc versions (jojomatik/blockcluster@d4488ba), but it has the same issues. Yesterday those workflows worked like a charm.

If I open up this issue in the wrong repository, I'm sorry. I'm not yet fully comprehending how all these docker parts work together :)

I'd be grateful if you'd find some time to look into this, Thank you! If you need additional information, feel free to ask. I couldn't find a PR template though.

@jojomatik
Copy link
Author

It seems to me like the error is coming from buildkit, at least that is the only place where I can find this snippet:
https://github.com/moby/buildkit/blob/2a4577efabed1b4404e2884ef56873b8c0a42e95/cache/remotecache/gha/gha.go#L319

        ce, err := p.ci.cache.Load(ctx, key)
	if err != nil {
		return nil, err
	}
	if ce == nil {
		return nil, errors.Errorf("blob not found")
	}

Should I open an issue in that repository?

@tonistiigi
Copy link
Member

@crazy-max If you're looking into this, the question is why https://github.com/moby/buildkit/pull/1498/files doesn't work properly in this case and fall back gracefully(assuming blob is legit and there is no other corruption). I guess it might be because Load returns a lazy remote reference and actual data is pulled later. In that case, we should verify the blob during load already, and ReaderAt only invokes the download.

@crazy-max
Copy link
Member

@tonistiigi

In that case, we should verify the blob during load already

Agree, I also think it might be a rough case where @jojomatik is using the actions/cache in other jobs in his workflow that could GC some blobs.

@jojomatik Can you enable debug in your workflow for the setup-buildx-action step?

- name: Set up Docker Buildx
  uses: docker/setup-buildx-action@v1
  with:
    version: v0.6.0
    buildkitd-flags: --debug

jojomatik added a commit to jojomatik/blockcluster that referenced this issue Jul 21, 2021

Verified

This commit was signed with the committer’s verified signature.
jojomatik jojomatik
Enable debug output to provide more information for cache restore issue (docker/buildx#681).
@jojomatik
Copy link
Author

jojomatik commented Jul 21, 2021

@jojomatik Can you enable debug in your workflow for the setup-buildx-action step?

Yes, here you go: jojomatik/blockcluster@2a26e2a (workflow).

Edit: I don't see any difference. Not sure if it worked?

I also think it might be a rough case where @jojomatik is using the actions/cache in other jobs in his workflow that could GC some blobs.

I believe I checked the docs at least to see if I'm exceeding some sort of storage quota and think I was well below that, but it's entirely possible that either

  • I cache more than I believe and therefore exceed some sort quota, or
  • my jobs somehow interfere with each other (but I don't think so, as it worked pretty great in the past).

But I think in either way it shouldn't just fail but rather rebuild that stage, don't you think?

On another note: the last steps (14 - 16) of the builder stage shouldn't actually be read from cache as it is rebuilt, if I understand the logic correctly:

#34 [builder 14/16] RUN npm run build
#34 0.282 
#34 0.282 > blockcluster-frontend@0.8.0 build
#34 0.282 > vue-cli-service build
#34 0.282 
[...]
#34 42.52 
#34 42.52   Images and other types of assets omitted.
#34 42.52 
#34 42.52  DONE  Build complete. The dist directory is ready to be deployed.
#34 42.52  INFO  Check out deployment instructions at https://cli.vuejs.org/guide/deployment.html
#34 42.52       
#34 DONE 43.1s

Not sure if these might be connected but I remember that I had one failed workflow in the past. It failed as I had already merged the PR and deleted the branch before the workflows were complete. Maybe that one corrupted the cache somehow?

@crazy-max
Copy link
Member

crazy-max commented Jul 21, 2021

@jojomatik

Edit: I don't see any difference. Not sure if it worked?

See https://github.com/docker/setup-buildx-action#buildkit-container-logs

But I think in either way it shouldn't just fail but rather rebuild that stage, don't you think?

Yes errors should be ignored by cache loader. That's an issue that needs to be fixed on BuildKit like @tonistiigi said.

I cache more than I believe and therefore exceed some sort quota, or

Can you use a fresh scope and let me know?:

          cache-from: type=gha,scope=blockcluster
          cache-to: type=gha,scope=blockcluster

jojomatik added a commit to jojomatik/blockcluster that referenced this issue Jul 21, 2021

Verified

This commit was signed with the committer’s verified signature.
jojomatik jojomatik
Use blockcluster scope for caching docker layers to check if it helps with the caching issue docker/buildx#681.
@jojomatik
Copy link
Author

See https://github.com/docker/setup-buildx-action#buildkit-container-logs

Thanks :)

Can you use a fresh scope and let me know?

Yes: jojomatik/blockcluster@f180097 (workflow). That one worked fine, as no cache is available. Should I try to rerun it to see whether the cache is loaded on a second run?

In case it helps: I just had a look, the other caches for frontend and backend CI seem to be around 160MB (2 * 80MB as I cache node_modules and ~/.npm as a fallback in case the dependencies change). The docker cache was around 800MB before I switched to the new cache backend (https://github.com/jojomatik/blockcluster/runs/3093472329). Not sure where I can find the new size.

@crazy-max
Copy link
Member

@jojomatik

Should I try to rerun it to see whether the cache is loaded on a second run?

Yes please.

@jojomatik
Copy link
Author

Rerun complete: https://github.com/jojomatik/blockcluster/runs/3124430755

Not sure if it's representative though, as it didn't have to decompress (not sure if this is the right term, I hope you know what I mean) the cache, to build further layers. It just realized that every layer was already cached and pushed it to the registry again.

I've gone ahead and added a small change (jojomatik/blockcluster@511a3f1) that should trigger a new workflow with a new vue rebuild (builder 14/16) as the commit sha changed.

Seems like this worked as well. Does this confirm that my cache was somehow corrupted?

@relu
Copy link

relu commented Jul 28, 2021

I'm also seeing this issue on a multi-stage image build. As you can see below it doesn't necessarily happen when doing cross-stage copying:

#16 exporting to oci image format
#16 exporting layers done
#16 exporting manifest sha256:45bdcb7e1f439a4b4b7d1514516b415b8a463f19abf8bd61f96a7532c6d4dbf7 done
#16 exporting config sha256:31819a14401a1cb90b358b07accd8505ce36b196b06a0b0503a58d8b0344f40f done
#16 ERROR: blob not found

#15 [base 9/9] COPY ./bin ./bin
#15 sha256:b15f66665ffaddb1d8fc6128396fd3947d9b89ef182ff28fffa6957986706015 0B / 4.47kB
#15 CACHED
------
 > exporting to oci image format:
------
error: failed to solve: blob not found
Error: buildx failed with: error: failed to solve: blob not found

Also probably worth mentioning is that these jobs run on self-hosted runners.

@fforootd
Copy link

We are seeing this as well on shared runners.

#25 [prod-angular-export 1/1] COPY --from=prod-angular-build /console/dist/console .
187
#25 CACHED
188

189
#26 exporting to client
190
#26 ERROR: blob not found
191
------
192
 > exporting to client:
193
------
194
error: failed to solve: blob not found
195
Error: buildx failed with: error: failed to solve: blob not found

The corresponding action log is here https://github.com/caos/zitadel/runs/3175870670?check_suite_focus=true#step:4:195

We see this across different branches and jobs.

@crazy-max
Copy link
Member

crazy-max commented Jul 29, 2021

Thanks for your feedback everyone.

Seems like this worked as well. Does this confirm that my cache was somehow corrupted?

@jojomatik Blobs are somehow missing but it should not fail in this case actually. We are going to investigate on this issue.

lopopolo added a commit to artichoke/docker-artichoke-nightly that referenced this issue Aug 9, 2021

Verified

This commit was signed with the committer’s verified signature.
lopopolo Ryan Lopopolo
There are still some bugs in the `gha` cache backend that are causing
builds to fail:

- docker/buildx#681
- docker/build-push-action#422

Removing caching until upstream resolves these issues.
@JanJakes
Copy link

JanJakes commented Aug 9, 2021

Run into probably the same issue – with type=gha (Buildx action with v0.6.0) and using scope. For some time the builds are OK and then suddenly they start to fail with:

error: failed to solve: failed to compute cache key: blob not found
Error: buildx failed with: error: failed to solve: failed to compute cache key: blob not found

It fails on COPY . . in the first stage of multi-stage build. There is almost nothing before that:

FROM node:14-buster-slim as packages
WORKDIR /build
COPY . .

And we use:

context: .
file: api/Dockerfile

@Niek
Copy link

Niek commented Aug 16, 2021

Same issue here, out of nowhere the builds start failing. Disabling the cache is our current workaround.

PiDelport added a commit to ntls-io/nautilus-wallet that referenced this issue Aug 31, 2021
There seems to be an issue currently where concurrent builds cause the
Docker GHA cache exporter to trip over GitHub throttling limits.

Background reading:

- buildx failed with: error: failed to solve: blob not found #422
docker/build-push-action#422

- Copy from previous stage fails #681
docker/buildx#681

- GHA export cache fails with mode=max #2276
moby/buildkit#2276

Try to work around this by limiting the overall concurrency to 1 for
this workflow.
PiDelport added a commit to ntls-io/nautilus-wallet that referenced this issue Sep 1, 2021
There seems to be an issue currently where concurrent builds cause the
Docker GHA cache exporter to trip over GitHub throttling limits.

Background reading:

- buildx failed with: error: failed to solve: blob not found #422
docker/build-push-action#422

- Copy from previous stage fails #681
docker/buildx#681

- GHA export cache fails with mode=max #2276
moby/buildkit#2276

Try to work around this by limiting the overall concurrency to 1 for
this workflow.
PiDelport added a commit to ntls-io/nautilus-wallet that referenced this issue Sep 1, 2021
There seems to be an issue currently where concurrent builds cause the
Docker GHA cache exporter to trip over GitHub throttling limits.

Background reading:

- buildx failed with: error: failed to solve: blob not found #422
docker/build-push-action#422

- Copy from previous stage fails #681
docker/buildx#681

- GHA export cache fails with mode=max #2276
moby/buildkit#2276

Try to work around this by limiting the overall concurrency to 1 for
this workflow.
PiDelport added a commit to ntls-io/nautilus-wallet that referenced this issue Sep 1, 2021
There seems to be an issue currently where concurrent builds cause the
Docker GHA cache exporter to trip over GitHub throttling limits.

Background reading:

- buildx failed with: error: failed to solve: blob not found #422
docker/build-push-action#422

- Copy from previous stage fails #681
docker/buildx#681

- GHA export cache fails with mode=max #2276
moby/buildkit#2276

Try to work around this by limiting the overall concurrency to 1 for
this workflow.
@patrikholcak
Copy link

Same issue here (private project). We have been using the gha cache for about 10 days and started seeing the same issue on all builds across branches, always failing on the step that has changed from build to build. The image with all layers is around 200MB (.NET Core app) with a couple of builds per day. We also cache node_modules for frontend — in a different workflow — same way as @jojomatik does.

We have reverted to using manual buildx cache and it works but isn’t really much of an improvement compared to not caching at all.

#30 [frontend  9/10] COPY js ./js
#30 ERROR: blob not found
------
 > [frontend  9/10] COPY js ./js:
------
Dockerfile:21
--------------------
  19 |     COPY css ./css
  20 |     COPY less ./less
  21 | >>> COPY js ./js
  22 |     RUN yarn build
  23 |     
--------------------
error: failed to solve: failed to compute cache key: blob not found
Error: buildx failed with: error: failed to solve: failed to compute cache key: blob not found
runs-on: ubuntu-20.04
steps:
  - uses: actions/checkout@v2
  - uses: docker/setup-buildx-action@v1
  - uses: docker/login-action@v1
      with:
        registry: ghcr.io
  - uses: docker/build-push-action@v2
     with:
       context: .
       push: true
       tags: ghcr.io/…
       cache-from: type=gha
       cache-to: type=gha,mode=max

PiDelport added a commit to ntls-io/nautilus-wallet that referenced this issue Sep 2, 2021
There seems to be an issue currently where concurrent builds cause the
Docker GHA cache exporter to trip over GitHub throttling limits.

Background reading:

- buildx failed with: error: failed to solve: blob not found #422
docker/build-push-action#422

- Copy from previous stage fails #681
docker/buildx#681

- GHA export cache fails with mode=max #2276
moby/buildkit#2276

Try to work around this by limiting the overall concurrency to 1 for
this workflow.
@crazy-max crazy-max changed the title Copy from previous stage fails GitHub Actions cache fails with cache key: blob not found Sep 9, 2021
chrispyles added a commit to chrispyles/otter-grader that referenced this issue Sep 20, 2021
agates added a commit to Podcastindex-org/podping-hivewriter that referenced this issue Sep 20, 2021

Verified

This commit was signed with the committer’s verified signature.
agates Alecks Gates
Caching is currently causing the build to fail
See docker/buildx#681
@crazy-max
Copy link
Member

Should be fixed since BuildKit 0.9.1.

@fforootd
Copy link

fforootd commented Oct 5, 2021

Thanks for the info and the good work! 🎉

@crazy-max
Copy link
Member

crazy-max commented Oct 5, 2021

If you encounter this kind of issue with GHA cache again please open an issue on BuildKit repo as this is mostly a backend issue. Thanks.

@Niek
Copy link

Niek commented Oct 5, 2021

Should be fixed since BuildKit 0.9.1.

Do we need to specify this in the docker/setup-buildx-action step until a new version is released?

with:
  driver-opts: image=moby/buildkit:0.9.1

@crazy-max
Copy link
Member

@Niek

Do we need to specify this in the docker/setup-buildx-action step until a new version is released?

Not needed the default tag has been updated.

decaby7e added a commit to SpiffyCloud/when2meet.me that referenced this issue Oct 7, 2021

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.
@KevinMind
Copy link

Ran into this today with v0.13.0 buildkit

https://github.com/mozilla/test-github-features/actions/runs/8294601823/job/22699933199?pr=20

@tonistiigi
Copy link
Member

@KevinMind and others reacting: This is a closed issue, please don't use it to report a regression in a new release. If it is same as moby/buildkit#4765 you can post your full issue details in there. There is a candidate fix in moby/buildkit#4771 for testing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants