`--experimental_remote_cache_eviction_retries` doesn't evict all metadata causing build failure #18696

BalestraPatrick · 2023-06-15T15:22:55Z

Description of the bug:

Hello!

We have used --experimental_remote_cache_eviction_retries=1 with --remote_download_toplevel during the last few weeks, and we noticed from our data that in some cases, when Bazel exits with exit code 39, the following invocation will fail with the exact same error. This is not always the case (we have seen it correctly recover in some situations), but for one specific CppArchive action, we see it failing in about 1% of our builds.

The logs look like the following:

# first invocation
ERROR: /Users/path/BUILD.bazel:122:16: Linking fl.a failed: Failed to fetch blobs because they do not exist remotely.: Missing digest: 6c23fb8f89c7a10035172c6f6fcadcd37a58078ac7f141bce745ee4dee40899c/159248
...
Found remote cache eviction error, retrying the build...
...
# second invocation
ERROR: /Users/path/BUILD.bazel:122:16: Linking fl.a failed: Failed to fetch blobs because they do not exist remotely.: Missing digest: 6c23fb8f89c7a10035172c6f6fcadcd37a58078ac7f141bce745ee4dee40899c/159248

The only way to recover from this failure seems to be to run bazel clean or disable BwtB (--remote_download_toplevel).

What's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

In the gRPC log for the second invocation, we see the following:

---------------------------------------------------------
metadata {
  tool_details {
    tool_name: "bazel"
    tool_version: "6.1.0-7ccf98d" <- don't mind this, we build Bazel from source from our own fork and we currently pass the wrong version even though it's based off the 6.2.0 branch.
  }
  action_id: "7d45850557079c80512a065d2250586c4ac53cd91a468cc137e1ca59b3e1fbaa"
  tool_invocation_id: "007220b0-22de-401e-a407-5d6dd56225bf"
  correlated_invocations_id: "cdaeafdf-8dde-44e0-94a6-6dbc3c47c3a6"
  action_mnemonic: "CppArchive"
  target_id: "//path:lib"
  configuration_id: "702272d6c0c9a184ce867b340aaf98a381dfa7e5f316a7a64e42777e966f1345"
}
status {
  code: 1
  message: "com.google.devtools.build.lib.remote.common.CacheNotFoundException: Missing digest: 6c23fb8f89c7a10035172c6f6fcadcd37a58078ac7f141bce745ee4dee40899c/159248"
}
method_name: "google.bytestream.ByteStream/Write"
details {
  write {
  }
}
start_time {
  seconds: 1686831469
  nanos: 922000000
}
end_time {
  seconds: 1686831469
  nanos: 943000000
}

Which operating system are you running Bazel on?

macOS

What is the output of `bazel info release`?

6.2.0 @ 286306e

If `bazel info release` returns `development version` or `(@non-git)`, tell us how you built Bazel.

No response

What's the output of `git remote get-url origin; git rev-parse master; git rev-parse HEAD` ?

No response

Is this a regression? If yes, please try to identify the Bazel commit where the bug was introduced.

No response

Have you found anything relevant by searching the web?

No response

Any other information, logs, or outputs that you want to share?

No response

The text was updated successfully, but these errors were encountered:

brentleyjones · 2023-06-15T15:25:43Z

@coeuvre

coeuvre · 2023-06-16T12:56:17Z

That's interesting. Maybe I am wrong, but I don't think the code for cleaning Bazel's internal state relies on the specific actions, i.e. if it works for other actions, it should work for CppArchive.

For the missing artifact, how long between it was generated and requested to be downloaded by Bazel? And how long one invocation of this build usually be?

Is that possible to collect logs from server side of the access patterns for this particular blob? e.g. create time, refresh time, delete time, and etc.

brentleyjones · 2023-06-16T13:00:45Z

For the missing artifact, how long between it was generated and requested to be downloaded by Bazel? And how long one invocation of this build usually be?

I don't think it matters? The first build fails with missing cache item, and it rebuilds the whole build, and the input shouldn't be assumed to exist, so the action that generates the missing item should be re-run. That's not happening. Instead the dependent action is simply run again (which is guaranteed to fail).

coeuvre · 2023-06-16T13:04:52Z

The generating action is expected to be re-run. Do you have execution log for the generating action from the second invoction? Otherwise, it might be the case that the file was evicted again before Bazel requested to download it.

brentleyjones · 2023-06-16T13:08:07Z

Patrick should be able to provide that.

The build log shows only 2 actions though (status.txt and the dependent action), in the gRPC log for the second run there isn't any upload (or any calls really) for that blob/action, and it was part of the findMissingBlobs call, so I assume that means Bazel knows that it wasn't generated locally.

coeuvre · 2023-06-16T13:11:17Z

The build log shows only 2 actions though (status.txt and the dependent action)

This looks like a Bazel bug to me. Probably the state was not cleaned due to race conditions.

BalestraPatrick · 2023-06-16T13:34:33Z

@coeuvre I sent you a gRPC log that I had from the second invocation that caused the build to fail. I hope it's useful. I can add some logging for collecting an execution log, and hope to catch it again in the next few days.

5 was picked randomly for --experimental_remote_cache_eviction_retries in the integration test. While it is working perfectly to test the rewinding, it cannot reports transient errors (e.g. #18696) because a next retry will probably fix that. This CL changes it to 1. Also add a assert to check invocation ids from second attempt is different to catch #18694. PiperOrigin-RevId: 541855893 Change-Id: I5f07cc1ed91a328454ed4949a04ff1acf6fa98b7

brentleyjones · 2024-02-27T17:15:51Z

@bazelbuild/triage this is marked as a P1 but hasn't seen movement in 9 months. Can we get some eyes on it please? Thank you.

iancha1992 · 2024-02-27T19:09:31Z

@bazelbuild/triage this is marked as a P1 but hasn't seen movement in 9 months. Can we get some eyes on it please? Thank you.

cc: @coeuvre for visibility

coeuvre · 2024-02-28T09:31:19Z

IIRC, the log @BalestraPatrick shared with me didn't reveal the root cause.

How frequent is the issue now in Bazel 7? Can you capture the logs again given that the code has changed a lot in Bazel 7?

acecilia · 2024-04-26T17:58:53Z

We are also experiencing this using Bazel 7, but is pretty hard to reproduce, I did not manage to reproduce it yet in a way that I can troubleshoot it

oliviernotteghem · 2024-05-13T01:05:46Z

We're hitting the same issue too, and often enough that BwoB is not usable for our (local) development environment. We're using Bazel 7. Can this be prioritize?

coeuvre · 2024-05-13T13:21:26Z

It's worth trying whether this is fixed by the upcoming release 7.2 which includes eda0fe4.

scaiper · 2024-06-03T15:43:09Z

7.2.0rc2 does not fix the issue for us.

dragos · 2024-07-23T07:46:56Z

@scaiper Note that REv2 is quite explicit about this:

https://github.com/bazelbuild/remote-apis/blob/7f51b3676b5d19df726783d9a861e45d7389b5ae/build/bazel/remote/execution/v2/remote_execution.proto#L160-L165

Enabling complenessChecking is not a workaround. It's a necessity

I think it would be good to change the spec from SHOULD to MUST in that case, or at least explain the consequences. Bazel is by far the most used client of this spec.

criemen · 2024-07-23T10:59:19Z

I think it would be good to change the spec from SHOULD to MUST in that case, or at least explain the consequences. Bazel is by far the most used client of this spec.

This doesn't really vibe with https://bazel.build/remote/caching, which
a) advertises the possiblity to bring any HTTP cache
b) explicitly tells you for the google cloud storage backend that you can enable object lifecycles to automatically delete old objects.

Even if b were not on the web page, it's pretty much implict in option a - if you bring a HTTP cache, you need to clean it up, you can't have unbounded growth of your cache. So what do you do? Expire the objects that either have lived the longest, or some sort of LRU if your cache is fancier. Bazel ought to be able to handle both, or otherwise it needs to be explicitly stated that these strategies are not supported, and you need to defer to special bazel cache software instead.

criemen · 2024-09-05T10:24:00Z

@coeuvre I'm not sure this is 100% necessary - if using a HTTP cache like buchgr/bazel-remote, that does understand CAS and AC relationships, and it can filter out AC results that reference expired CAS hits. Therefore it'd be great to be able to force-enable BwoB in those cases (due to havingto connect through nginx (outside of my control), I can't connect via grpc to my bazel cache service instance, but I'd like to continue using Bwob, as nginx doesn't proxy http2 🙄 ).

coeuvre · 2024-09-05T11:33:30Z

I will revert the change and explore other possible solutions.

tgeng · 2024-09-06T01:27:20Z

Will the revert (b899783) be cherry-picked to 7.4.0 release?

coeuvre · 2024-09-06T08:07:18Z

The original commit is not cherry-picked to 7.4.0 so cherry-pick for the revert is not necessary.

coeuvre · 2024-09-09T13:10:37Z

@bazel-io fork 7.4.0

Previously, it's possible for an HTTP cache to delete CAS entries referenced by AC without deleting the AC itself because HTTP cache doesn't understand the relationship between AC and CAS. This could result in permanent build errors because Bazel always trust the AC from remote cache assuming all referenced CAS entries exist. Now, we record the digest of lost blobs before build rewinding, so that during the next build, we can ignore the stale AC and continue with execution. Fixes bazelbuild#18696. RELNOTES: Added support for using a remote cache that evicts blobs and doesn't have AC integrity check (e.g. HTTP cache). PiperOrigin-RevId: 672536163 Change-Id: Ic1271431d28333f6d86e5963542d15a133075157

Previously, it's possible for an HTTP cache to delete CAS entries referenced by AC without deleting the AC itself because HTTP cache doesn't understand the relationship between AC and CAS. This could result in permanent build errors because Bazel always trust the AC from remote cache assuming all referenced CAS entries exist. Now, we record the digest of lost blobs before build rewinding, so that during the next build, we can ignore the stale AC and continue with execution. Fixes #18696. RELNOTES: Added support for using a remote cache that evicts blobs and doesn't have AC integrity check (e.g. HTTP cache). PiperOrigin-RevId: 672536163 Change-Id: Ic1271431d28333f6d86e5963542d15a133075157 Commit 5d81579 Co-authored-by: Googler <chiwang@google.com>

iancha1992 · 2024-10-11T21:10:54Z

A fix for this issue has been included in Bazel 7.4.0 RC1. Please test out the release candidate and report any issues as soon as possible.
If you're using Bazelisk, you can point to the latest RC by setting USE_BAZEL_VERSION=7.4.0rc1. Thanks!

acecilia · 2024-10-16T16:21:04Z

I am testing 7.4.0 RC2. Will update here if I observe build failures. If I dont say anything consider it as a validation that the fix works as expected 😄

acecilia · 2024-10-30T11:03:59Z

👋 Using 7.4.0rc2 I continue observing errors like:

ERROR: /Users/.../MyTarget/BUILD.bazel:3:16: Bundling, processing and signing MyTarget.__internal__.__test_bundle failed: Failed to fetch blobs because they do not exist remotely.: Missing digest: ca3d163bab055381827226140568f3bef7eaac187cebd76878e0b63e9e442356/3 for bazel-out/ios_sim_arm64-fastbuild-ios-sim_arm64-min13.0-applebin_ios-ST-37c686bf6546/bin/.../OtherTarget.bundle/response.json

My configuration is:

Without customization of --experimental_remote_cache_eviction_retries, so it defaults to 5
Using --remote_download_toplevel
Use a google cloud bucket as remote cache

The thing is: I only see the error once in the logs, but I would expect to see it 5 times due to the default value of --experimental_remote_cache_eviction_retries 🤔

coeuvre · 2024-10-30T15:10:55Z

Since you mentioned the error only appeared once in the log, does it mean the build completed successfully after retries? If not, what's the error message?

acecilia · 2024-10-30T16:55:29Z

Nope, the build failed after the Failed to fetch blobs because they do not exist remotely error. So I am wondering if the default value 5 of --experimental_remote_cache_eviction_retries is somehow not real 🤔

coeuvre · 2024-10-30T17:17:45Z

Argh, I think --experimental_remote_cache_eviction_retries only defaults to 5 in 8.x. Can you try with explicitly setting the flag?

acecilia · 2024-10-30T17:38:03Z

Aha, You are right, that explains it. Thanks! I will set it and continue monitoring 🤞

acecilia · 2024-11-30T22:06:44Z

👋 Did not see the error again since last comment. Thanks! 🙏

BalestraPatrick added type: bug untriaged labels Jun 15, 2023

BalestraPatrick assigned sgowroji and Pavank1992 Jun 15, 2023

iancha1992 added the team-Remote-Exec Issues and PRs for the Execution (Remote) team label Jun 16, 2023

iancha1992 unassigned sgowroji and Pavank1992 Jun 16, 2023

coeuvre self-assigned this Jun 16, 2023

coeuvre added P1 I'll work on this now. (Assignee required) and removed untriaged labels Jun 16, 2023

aherrmann mentioned this issue Feb 4, 2024

feat: implicit bazel_builtin package dependency aherrmann/rules_zig#211

Merged

coeuvre added P2 We'll consider working on this in future. (Assignee optional) more data needed and removed P1 I'll work on this now. (Assignee required) labels Feb 28, 2024

acecilia mentioned this issue May 2, 2024

Constant "remote cache eviction error" message #22220

Closed

copybara-service bot closed this as completed in 3e464fc Sep 5, 2024

coeuvre reopened this Sep 5, 2024

copybara-service bot closed this as completed in b899783 Sep 5, 2024

tjgq reopened this Sep 5, 2024

tjgq mentioned this issue Sep 5, 2024

Updated docs to use mdbook bazelbuild/rules_rust#2834

Merged

bazel-io mentioned this issue Sep 9, 2024

[7.4.0] --experimental_remote_cache_eviction_retries doesn't evict all metadata causing build failure #23552

Closed

copybara-service bot closed this as completed in 5d81579 Sep 9, 2024

iancha1992 mentioned this issue Sep 9, 2024

[7.4.0] Make HTTP cache works with cache eviction. #23561

Merged

tjgq mentioned this issue Sep 27, 2024

Bazel 7 unable to finalize action due to missing digest for .d files when --experimental_inmemory_dotd_files is set. #22387

Closed

coeuvre mentioned this issue Oct 1, 2024

unable to finalize action: Missing digest: <hash>/<len> for ...jdeps #22854

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`--experimental_remote_cache_eviction_retries` doesn't evict all metadata causing build failure #18696

`--experimental_remote_cache_eviction_retries` doesn't evict all metadata causing build failure #18696

BalestraPatrick commented Jun 15, 2023

brentleyjones commented Jun 15, 2023

coeuvre commented Jun 16, 2023

brentleyjones commented Jun 16, 2023

coeuvre commented Jun 16, 2023 •

edited

Loading

brentleyjones commented Jun 16, 2023 •

edited

Loading

coeuvre commented Jun 16, 2023

BalestraPatrick commented Jun 16, 2023

brentleyjones commented Feb 27, 2024

iancha1992 commented Feb 27, 2024

coeuvre commented Feb 28, 2024

acecilia commented Apr 26, 2024 •

edited

Loading

oliviernotteghem commented May 13, 2024

coeuvre commented May 13, 2024

scaiper commented Jun 3, 2024

dragos commented Jul 23, 2024

criemen commented Jul 23, 2024

criemen commented Sep 5, 2024 •

edited

Loading

coeuvre commented Sep 5, 2024

tgeng commented Sep 6, 2024 •

edited

Loading

coeuvre commented Sep 6, 2024

coeuvre commented Sep 9, 2024

iancha1992 commented Oct 11, 2024

acecilia commented Oct 16, 2024

acecilia commented Oct 30, 2024 •

edited

Loading

coeuvre commented Oct 30, 2024 •

edited

Loading

acecilia commented Oct 30, 2024

coeuvre commented Oct 30, 2024

acecilia commented Oct 30, 2024

acecilia commented Nov 30, 2024

--experimental_remote_cache_eviction_retries doesn't evict all metadata causing build failure #18696

--experimental_remote_cache_eviction_retries doesn't evict all metadata causing build failure #18696

Comments

BalestraPatrick commented Jun 15, 2023

Description of the bug:

What's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

Which operating system are you running Bazel on?

What is the output of bazel info release?

If bazel info release returns development version or (@non-git), tell us how you built Bazel.

What's the output of git remote get-url origin; git rev-parse master; git rev-parse HEAD ?

Is this a regression? If yes, please try to identify the Bazel commit where the bug was introduced.

Have you found anything relevant by searching the web?

Any other information, logs, or outputs that you want to share?

brentleyjones commented Jun 15, 2023

coeuvre commented Jun 16, 2023

brentleyjones commented Jun 16, 2023

coeuvre commented Jun 16, 2023 • edited Loading

brentleyjones commented Jun 16, 2023 • edited Loading

coeuvre commented Jun 16, 2023

BalestraPatrick commented Jun 16, 2023

brentleyjones commented Feb 27, 2024

iancha1992 commented Feb 27, 2024

coeuvre commented Feb 28, 2024

acecilia commented Apr 26, 2024 • edited Loading

oliviernotteghem commented May 13, 2024

coeuvre commented May 13, 2024

scaiper commented Jun 3, 2024

dragos commented Jul 23, 2024

criemen commented Jul 23, 2024

criemen commented Sep 5, 2024 • edited Loading

coeuvre commented Sep 5, 2024

tgeng commented Sep 6, 2024 • edited Loading

coeuvre commented Sep 6, 2024

coeuvre commented Sep 9, 2024

iancha1992 commented Oct 11, 2024

acecilia commented Oct 16, 2024

acecilia commented Oct 30, 2024 • edited Loading

coeuvre commented Oct 30, 2024 • edited Loading

acecilia commented Oct 30, 2024

coeuvre commented Oct 30, 2024

acecilia commented Oct 30, 2024

acecilia commented Nov 30, 2024

`--experimental_remote_cache_eviction_retries` doesn't evict all metadata causing build failure #18696

`--experimental_remote_cache_eviction_retries` doesn't evict all metadata causing build failure #18696

What is the output of `bazel info release`?

If `bazel info release` returns `development version` or `(@non-git)`, tell us how you built Bazel.

What's the output of `git remote get-url origin; git rev-parse master; git rev-parse HEAD` ?

coeuvre commented Jun 16, 2023 •

edited

Loading

brentleyjones commented Jun 16, 2023 •

edited

Loading

acecilia commented Apr 26, 2024 •

edited

Loading

criemen commented Sep 5, 2024 •

edited

Loading

tgeng commented Sep 6, 2024 •

edited

Loading

acecilia commented Oct 30, 2024 •

edited

Loading

coeuvre commented Oct 30, 2024 •

edited

Loading