Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

--experimental_remote_cache_eviction_retries doesn't evict all metadata causing build failure #18696

Closed
BalestraPatrick opened this issue Jun 15, 2023 · 47 comments
Assignees
Labels
more data needed P2 We'll consider working on this in future. (Assignee optional) team-Remote-Exec Issues and PRs for the Execution (Remote) team type: bug

Comments

@BalestraPatrick
Copy link
Member

Description of the bug:

Hello!

We have used --experimental_remote_cache_eviction_retries=1 with --remote_download_toplevel during the last few weeks, and we noticed from our data that in some cases, when Bazel exits with exit code 39, the following invocation will fail with the exact same error. This is not always the case (we have seen it correctly recover in some situations), but for one specific CppArchive action, we see it failing in about 1% of our builds.

The logs look like the following:

# first invocation
ERROR: /Users/path/BUILD.bazel:122:16: Linking fl.a failed: Failed to fetch blobs because they do not exist remotely.: Missing digest: 6c23fb8f89c7a10035172c6f6fcadcd37a58078ac7f141bce745ee4dee40899c/159248
...
Found remote cache eviction error, retrying the build...
...
# second invocation
ERROR: /Users/path/BUILD.bazel:122:16: Linking fl.a failed: Failed to fetch blobs because they do not exist remotely.: Missing digest: 6c23fb8f89c7a10035172c6f6fcadcd37a58078ac7f141bce745ee4dee40899c/159248

The only way to recover from this failure seems to be to run bazel clean or disable BwtB (--remote_download_toplevel).

What's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

In the gRPC log for the second invocation, we see the following:

---------------------------------------------------------
metadata {
  tool_details {
    tool_name: "bazel"
    tool_version: "6.1.0-7ccf98d" <- don't mind this, we build Bazel from source from our own fork and we currently pass the wrong version even though it's based off the 6.2.0 branch.
  }
  action_id: "7d45850557079c80512a065d2250586c4ac53cd91a468cc137e1ca59b3e1fbaa"
  tool_invocation_id: "007220b0-22de-401e-a407-5d6dd56225bf"
  correlated_invocations_id: "cdaeafdf-8dde-44e0-94a6-6dbc3c47c3a6"
  action_mnemonic: "CppArchive"
  target_id: "//path:lib"
  configuration_id: "702272d6c0c9a184ce867b340aaf98a381dfa7e5f316a7a64e42777e966f1345"
}
status {
  code: 1
  message: "com.google.devtools.build.lib.remote.common.CacheNotFoundException: Missing digest: 6c23fb8f89c7a10035172c6f6fcadcd37a58078ac7f141bce745ee4dee40899c/159248"
}
method_name: "google.bytestream.ByteStream/Write"
details {
  write {
  }
}
start_time {
  seconds: 1686831469
  nanos: 922000000
}
end_time {
  seconds: 1686831469
  nanos: 943000000
}

Which operating system are you running Bazel on?

macOS

What is the output of bazel info release?

6.2.0 @ 286306e

If bazel info release returns development version or (@non-git), tell us how you built Bazel.

No response

What's the output of git remote get-url origin; git rev-parse master; git rev-parse HEAD ?

No response

Is this a regression? If yes, please try to identify the Bazel commit where the bug was introduced.

No response

Have you found anything relevant by searching the web?

No response

Any other information, logs, or outputs that you want to share?

No response

@brentleyjones
Copy link
Contributor

@coeuvre

@iancha1992 iancha1992 added the team-Remote-Exec Issues and PRs for the Execution (Remote) team label Jun 16, 2023
@coeuvre
Copy link
Member

coeuvre commented Jun 16, 2023

That's interesting. Maybe I am wrong, but I don't think the code for cleaning Bazel's internal state relies on the specific actions, i.e. if it works for other actions, it should work for CppArchive.

For the missing artifact, how long between it was generated and requested to be downloaded by Bazel? And how long one invocation of this build usually be?

Is that possible to collect logs from server side of the access patterns for this particular blob? e.g. create time, refresh time, delete time, and etc.

@brentleyjones
Copy link
Contributor

For the missing artifact, how long between it was generated and requested to be downloaded by Bazel? And how long one invocation of this build usually be?

I don't think it matters? The first build fails with missing cache item, and it rebuilds the whole build, and the input shouldn't be assumed to exist, so the action that generates the missing item should be re-run. That's not happening. Instead the dependent action is simply run again (which is guaranteed to fail).

@coeuvre
Copy link
Member

coeuvre commented Jun 16, 2023

The generating action is expected to be re-run. Do you have execution log for the generating action from the second invoction? Otherwise, it might be the case that the file was evicted again before Bazel requested to download it.

@brentleyjones
Copy link
Contributor

brentleyjones commented Jun 16, 2023

Patrick should be able to provide that.

The build log shows only 2 actions though (status.txt and the dependent action), in the gRPC log for the second run there isn't any upload (or any calls really) for that blob/action, and it was part of the findMissingBlobs call, so I assume that means Bazel knows that it wasn't generated locally.

@coeuvre
Copy link
Member

coeuvre commented Jun 16, 2023

The build log shows only 2 actions though (status.txt and the dependent action)

This looks like a Bazel bug to me. Probably the state was not cleaned due to race conditions.

@coeuvre coeuvre self-assigned this Jun 16, 2023
@coeuvre coeuvre added P1 I'll work on this now. (Assignee required) and removed untriaged labels Jun 16, 2023
@BalestraPatrick
Copy link
Member Author

@coeuvre I sent you a gRPC log that I had from the second invocation that caused the build to fail. I hope it's useful. I can add some logging for collecting an execution log, and hope to catch it again in the next few days.

copybara-service bot pushed a commit that referenced this issue Jun 20, 2023
5 was picked randomly for --experimental_remote_cache_eviction_retries in the integration test. While it is working perfectly to test the rewinding, it cannot reports transient errors (e.g. #18696) because a next retry will probably fix that. This CL changes it to 1.

Also add a assert to check invocation ids from second attempt is different to catch #18694.

PiperOrigin-RevId: 541855893
Change-Id: I5f07cc1ed91a328454ed4949a04ff1acf6fa98b7
@brentleyjones
Copy link
Contributor

@bazelbuild/triage this is marked as a P1 but hasn't seen movement in 9 months. Can we get some eyes on it please? Thank you.

@iancha1992
Copy link
Member

@bazelbuild/triage this is marked as a P1 but hasn't seen movement in 9 months. Can we get some eyes on it please? Thank you.

cc: @coeuvre for visibility

@coeuvre
Copy link
Member

coeuvre commented Feb 28, 2024

IIRC, the log @BalestraPatrick shared with me didn't reveal the root cause.

How frequent is the issue now in Bazel 7? Can you capture the logs again given that the code has changed a lot in Bazel 7?

@coeuvre coeuvre added P2 We'll consider working on this in future. (Assignee optional) more data needed and removed P1 I'll work on this now. (Assignee required) labels Feb 28, 2024
@acecilia
Copy link

acecilia commented Apr 26, 2024

We are also experiencing this using Bazel 7, but is pretty hard to reproduce, I did not manage to reproduce it yet in a way that I can troubleshoot it

@oliviernotteghem
Copy link

We're hitting the same issue too, and often enough that BwoB is not usable for our (local) development environment. We're using Bazel 7. Can this be prioritize?

@coeuvre
Copy link
Member

coeuvre commented May 13, 2024

It's worth trying whether this is fixed by the upcoming release 7.2 which includes eda0fe4.

@scaiper
Copy link

scaiper commented Jun 3, 2024

7.2.0rc2 does not fix the issue for us.

@dragos
Copy link

dragos commented Jul 23, 2024

@scaiper Note that REv2 is quite explicit about this:

https://github.com/bazelbuild/remote-apis/blob/7f51b3676b5d19df726783d9a861e45d7389b5ae/build/bazel/remote/execution/v2/remote_execution.proto#L160-L165

Enabling complenessChecking is not a workaround. It's a necessity

I think it would be good to change the spec from SHOULD to MUST in that case, or at least explain the consequences. Bazel is by far the most used client of this spec.

@criemen
Copy link
Contributor

criemen commented Jul 23, 2024

I think it would be good to change the spec from SHOULD to MUST in that case, or at least explain the consequences. Bazel is by far the most used client of this spec.

This doesn't really vibe with https://bazel.build/remote/caching, which
a) advertises the possiblity to bring any HTTP cache
b) explicitly tells you for the google cloud storage backend that you can enable object lifecycles to automatically delete old objects.

Even if b were not on the web page, it's pretty much implict in option a - if you bring a HTTP cache, you need to clean it up, you can't have unbounded growth of your cache. So what do you do? Expire the objects that either have lived the longest, or some sort of LRU if your cache is fancier. Bazel ought to be able to handle both, or otherwise it needs to be explicitly stated that these strategies are not supported, and you need to defer to special bazel cache software instead.

@criemen
Copy link
Contributor

criemen commented Sep 5, 2024

@coeuvre I'm not sure this is 100% necessary - if using a HTTP cache like buchgr/bazel-remote, that does understand CAS and AC relationships, and it can filter out AC results that reference expired CAS hits. Therefore it'd be great to be able to force-enable BwoB in those cases (due to havingto connect through nginx (outside of my control), I can't connect via grpc to my bazel cache service instance, but I'd like to continue using Bwob, as nginx doesn't proxy http2 🙄 ).

@coeuvre coeuvre reopened this Sep 5, 2024
@coeuvre
Copy link
Member

coeuvre commented Sep 5, 2024

I will revert the change and explore other possible solutions.

@tgeng
Copy link
Contributor

tgeng commented Sep 6, 2024

Will the revert (b899783) be cherry-picked to 7.4.0 release?

@coeuvre
Copy link
Member

coeuvre commented Sep 6, 2024

The original commit is not cherry-picked to 7.4.0 so cherry-pick for the revert is not necessary.

@coeuvre
Copy link
Member

coeuvre commented Sep 9, 2024

@bazel-io fork 7.4.0

iancha1992 pushed a commit to iancha1992/bazel that referenced this issue Sep 9, 2024
Previously, it's possible for an HTTP cache to delete CAS entries referenced by AC without deleting the AC itself because HTTP cache doesn't understand the relationship between AC and CAS. This could result in permanent build errors because Bazel always trust the AC from remote cache assuming all referenced CAS entries exist.

Now, we record the digest of lost blobs before build rewinding, so that during the next build, we can ignore the stale AC and continue with execution.

Fixes bazelbuild#18696.

RELNOTES: Added support for using a remote cache that evicts blobs and doesn't have AC integrity check (e.g. HTTP cache).
PiperOrigin-RevId: 672536163
Change-Id: Ic1271431d28333f6d86e5963542d15a133075157
github-merge-queue bot pushed a commit that referenced this issue Sep 10, 2024
Previously, it's possible for an HTTP cache to delete CAS entries
referenced by AC without deleting the AC itself because HTTP cache
doesn't understand the relationship between AC and CAS. This could
result in permanent build errors because Bazel always trust the AC from
remote cache assuming all referenced CAS entries exist.

Now, we record the digest of lost blobs before build rewinding, so that
during the next build, we can ignore the stale AC and continue with
execution.

Fixes #18696.

RELNOTES: Added support for using a remote cache that evicts blobs and
doesn't have AC integrity check (e.g. HTTP cache).
PiperOrigin-RevId: 672536163
Change-Id: Ic1271431d28333f6d86e5963542d15a133075157

Commit
5d81579

Co-authored-by: Googler <chiwang@google.com>
@iancha1992
Copy link
Member

A fix for this issue has been included in Bazel 7.4.0 RC1. Please test out the release candidate and report any issues as soon as possible.
If you're using Bazelisk, you can point to the latest RC by setting USE_BAZEL_VERSION=7.4.0rc1. Thanks!

@acecilia
Copy link

I am testing 7.4.0 RC2. Will update here if I observe build failures. If I dont say anything consider it as a validation that the fix works as expected 😄

@acecilia
Copy link

acecilia commented Oct 30, 2024

👋 Using 7.4.0rc2 I continue observing errors like:

ERROR: /Users/.../MyTarget/BUILD.bazel:3:16: Bundling, processing and signing MyTarget.__internal__.__test_bundle failed: Failed to fetch blobs because they do not exist remotely.: Missing digest: ca3d163bab055381827226140568f3bef7eaac187cebd76878e0b63e9e442356/3 for bazel-out/ios_sim_arm64-fastbuild-ios-sim_arm64-min13.0-applebin_ios-ST-37c686bf6546/bin/.../OtherTarget.bundle/response.json

My configuration is:

  • Without customization of --experimental_remote_cache_eviction_retries, so it defaults to 5
  • Using --remote_download_toplevel
  • Use a google cloud bucket as remote cache

The thing is: I only see the error once in the logs, but I would expect to see it 5 times due to the default value of --experimental_remote_cache_eviction_retries 🤔

@coeuvre
Copy link
Member

coeuvre commented Oct 30, 2024

Since you mentioned the error only appeared once in the log, does it mean the build completed successfully after retries? If not, what's the error message?

@acecilia
Copy link

Nope, the build failed after the Failed to fetch blobs because they do not exist remotely error. So I am wondering if the default value 5 of --experimental_remote_cache_eviction_retries is somehow not real 🤔

@coeuvre
Copy link
Member

coeuvre commented Oct 30, 2024

Argh, I think --experimental_remote_cache_eviction_retries only defaults to 5 in 8.x. Can you try with explicitly setting the flag?

@acecilia
Copy link

Aha, You are right, that explains it. Thanks! I will set it and continue monitoring 🤞

@acecilia
Copy link

👋 Did not see the error again since last comment. Thanks! 🙏

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
more data needed P2 We'll consider working on this in future. (Assignee optional) team-Remote-Exec Issues and PRs for the Execution (Remote) team type: bug
Projects
None yet
Development

No branches or pull requests