-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Spuriously breakage in Gerrit CI after upgrading from 7.0.0rc2 to 7.0.0rc3 #20161
Comments
Could you test with |
Unfortunately, with this option the error is still present. Also, I downgraded to 7.0.0.rc2 (from 7.0.0rc3) it's still failing. |
I also added "-s" option, and produced this verbose output: https://gerrit-ci.gerritforge.com/job/Gerrit-verifier-chrome-latest/40258/console
|
@tjgq Do you have an idea? |
@bazel-io flag |
@davido Do I understand it correctly that you're building with a disk cache, but not with a remote cache? Is this build clean or incremental? Do you have any sort of process that removes entries from the disk cache between builds? |
@bazel-io fork 7.0.0 |
From the CI log, it seems like you are using remote cache and these errors were caused by remote cache eviction. Can you check whether adding flag |
@coeuvre how can this happen just with the local disk cache? Race between multiple workers? |
I think they are using remote cache. The flag was passed with env:
Also, |
First of all we are using a combination of RBE and local build. Some stuff we can only test locally. The failing part is built locally on GCP-machines. We have both options, disc cache and remote cache, see, e.g. BAZEL_OPTS=--remote_cache=https://gerrit-ci.gerritforge.com/cache However, we have this hidden logic on the CI side to take remote cache out of the picture, if git show --diff-filter=AM --name-only --pretty="" HEAD \| grep -q .bazelversion
then
export BAZEL_OPTS=""
fi This is the part of the CI that was failing:
@lucamilanesio Are you aware of any cache evictions on the remote cache side recently? |
So, to verify, that remote cache contributes to the problem, I upgraded (again) the Bazel version from 7.0.0rc2 to 7.0.0rc3, and uploaded a new patch set (22). As explained in my previous comment, this would skip remote cache usage and the verification was successful: [1]. I'm going to remove the changes in [1] https://gerrit-review.googlesource.com/c/gerrit/+/387837/22 |
@coeuvre, adding |
@davido Can you confirm whether entries can spuriously disappear from your disk and/or remote cache in between builds? If they can, then you must use |
@lucamilanesio Can you help to answer the @tjgq 's question? |
Since it's still unclear if this is a Bazel bug, I'll remove this bug as a release blocker for 7.0. Closing #20175. |
@meteorcloudy Agreed. Let's close this then as not an issue. |
They cannot disappear from the local disk, however, once a day during the remote cache cleanups, they can be removed remotely. The step that is failing though did not use any remote cache: how is that possible that Bazel would assume that the cache is remote if there isn't a remote cache configured? It looks like the local cache "remembers" that it was fed by a remote cache, because the previous step actually used a remote cache for the intial build.
Well, but that isn't the case, as mentioned above. If I add the remote cache URL in the |
Reopening the issue, as we are seeing this on Gerrit CI again and this downstream issue with priority 0 was filed: 1. Excerpt from downstream issue: The build steps that are executed for the validation are: #0
export BAZEL_OPTS=--remote_cache=https://gerrit-ci.gerritforge.com/cache
#1
bazelisk build $BAZEL_OPTS plugins:core release api
#2
tools/maven/api.sh install
#3
tools/eclipse/project.py --bazel bazelisk Only the first build command above is using remote cache, the subsequent commands don't use remote cache, and started to consistently fail on Gerrit CI after bump of Bazel version from 7.0.0-rc2 and 7.0.0-rc3. The second command: bazelisk build //tools/maven:gen_api_install Which is failing with this error now:
@coeuvre @tjgq @meteorcloudy @fmeum In fact, passing: Also note, that if we pass the remote cache option to all three build commands above, they all succeed. So, in both cases (with and without remote cache): we are using repository cache and disk cache, as part of the
^^^ Can it be somehow related? |
I can reproduce the issue locally now. As assumed, the problem is related to the disk cache. Here are the steps:
$ docker pull buchgr/bazel-remote-cache
$ docker run -u 1000:1000 -v /path/to/cache/dir:/data \
-p 9090:8080 -p 9092:9092 buchgr/bazel-remote-cache \
--max_size 5
$ bazelisk build --remote_cache=http://server:9090 plugins:core release api
$ rm -rf ~/.gerritcodereview/bazel-cache/cas/
davido@localhost:~/projects/gerrit (master %>)$ tools/eclipse/project.py --bazel bazelisk
INFO: Invocation ID: 6084a97c-1b8d-4850-bcb1-f37c2f84fa37
INFO: Options provided by the client:
Inherited 'common' options: --isatty=0 --terminal_columns=80
INFO: Reading rc options for 'info' from /home/davido/projects/gerrit/.bazelrc:
Inherited 'common' options: --noenable_bzlmod
INFO: Reading rc options for 'info' from /home/davido/projects/gerrit/.bazelrc:
Inherited 'build' options: --workspace_status_command=python3 ./tools/workspace_status.py --repository_cache=~/.gerritcodereview/bazel-cache/repository --action_env=PATH --disk_cache=~/.gerritcodereview/bazel-cache/cas --java_language_version=17 --java_runtime_version=remotejdk_17 --tool_java_language_version=17 --tool_java_runtime_version=remotejdk_17 --cxxopt=-std=c++17 --host_cxxopt=-std=c++17 --incompatible_strict_action_env --announce_rc
INFO: Invocation ID: 3e774f39-267e-4841-8a37-b1e2890edb39
INFO: Options provided by the client:
Inherited 'common' options: --isatty=1 --terminal_columns=147
INFO: Reading rc options for 'build' from /home/davido/projects/gerrit/.bazelrc:
Inherited 'common' options: --noenable_bzlmod
INFO: Reading rc options for 'build' from /home/davido/projects/gerrit/.bazelrc:
'build' options: --workspace_status_command=python3 ./tools/workspace_status.py --repository_cache=~/.gerritcodereview/bazel-cache/repository --action_env=PATH --disk_cache=~/.gerritcodereview/bazel-cache/cas --java_language_version=17 --java_runtime_version=remotejdk_17 --tool_java_language_version=17 --tool_java_runtime_version=remotejdk_17 --cxxopt=-std=c++17 --host_cxxopt=-std=c++17 --incompatible_strict_action_env --announce_rc
INFO: Analyzed target //tools/eclipse:main_classpath_collect (10 packages loaded, 182 targets configured).
INFO: Found 1 target...
Target //tools/eclipse:main_classpath_collect up-to-date:
bazel-bin/tools/eclipse/main_classpath_collect.runtime_classpath
INFO: Elapsed time: 1.093s, Critical Path: 0.81s
INFO: 2 processes: 2 internal.
INFO: Build completed successfully, 2 total actions
INFO: Invocation ID: 578b1a90-ad9f-478b-98f4-20818be06888
INFO: Options provided by the client:
Inherited 'common' options: --isatty=1 --terminal_columns=147
INFO: Reading rc options for 'build' from /home/davido/projects/gerrit/.bazelrc:
Inherited 'common' options: --noenable_bzlmod
INFO: Reading rc options for 'build' from /home/davido/projects/gerrit/.bazelrc:
'build' options: --workspace_status_command=python3 ./tools/workspace_status.py --repository_cache=~/.gerritcodereview/bazel-cache/repository --action_env=PATH --disk_cache=~/.gerritcodereview/bazel-cache/cas --java_language_version=17 --java_runtime_version=remotejdk_17 --tool_java_language_version=17 --tool_java_runtime_version=remotejdk_17 --cxxopt=-std=c++17 --host_cxxopt=-std=c++17 --incompatible_strict_action_env --announce_rc
INFO: Analyzed target //tools/eclipse:autovalue_classpath_collect (0 packages loaded, 7 targets configured).
INFO: Found 1 target...
Target //tools/eclipse:autovalue_classpath_collect up-to-date:
bazel-bin/tools/eclipse/autovalue_classpath_collect.runtime_classpath
INFO: Elapsed time: 1.111s, Critical Path: 0.69s
INFO: 2 processes: 2 internal.
INFO: Build completed successfully, 2 total actions
INFO: Invocation ID: 0c23b3fe-303d-4076-97c6-488fbf009f94
INFO: Options provided by the client:
Inherited 'common' options: --isatty=1 --terminal_columns=147
INFO: Reading rc options for 'build' from /home/davido/projects/gerrit/.bazelrc:
Inherited 'common' options: --noenable_bzlmod
INFO: Reading rc options for 'build' from /home/davido/projects/gerrit/.bazelrc:
'build' options: --workspace_status_command=python3 ./tools/workspace_status.py --repository_cache=~/.gerritcodereview/bazel-cache/repository --action_env=PATH --disk_cache=~/.gerritcodereview/bazel-cache/cas --java_language_version=17 --java_runtime_version=remotejdk_17 --tool_java_language_version=17 --tool_java_runtime_version=remotejdk_17 --cxxopt=-std=c++17 --host_cxxopt=-std=c++17 --incompatible_strict_action_env --announce_rc
INFO: Analyzed target //tools/eclipse:classpath (0 packages loaded, 1 target configured).
ERROR: /home/davido/projects/gerrit/proto/testing/BUILD:4:14: Generating proto_library //proto/testing:test_proto failed: Failed to fetch blobs because they do not exist remotely.: 3 errors during bulk transfer:
com.google.devtools.build.lib.remote.common.CacheNotFoundException: Missing digest: 74c97c32ccbc58b7d77ca61e6ec0d576d9f47173b3360c4f31e73a265162cd1f/4388096 for bazel-out/k8-opt-exec-ST-13d3ddad9198/bin/external/com_google_protobuf/protoc
com.google.devtools.build.lib.remote.common.CacheNotFoundException: Missing digest: 74c97c32ccbc58b7d77ca61e6ec0d576d9f47173b3360c4f31e73a265162cd1f/4388096 for bazel-out/k8-opt-exec-ST-13d3ddad9198/bin/external/com_google_protobuf/protoc
com.google.devtools.build.lib.remote.common.CacheNotFoundException: Missing digest: 74c97c32ccbc58b7d77ca61e6ec0d576d9f47173b3360c4f31e73a265162cd1f/4388096 for bazel-out/k8-opt-exec-ST-13d3ddad9198/bin/external/com_google_protobuf/protoc
Target //tools/eclipse:classpath failed to build
Use --verbose_failures to see the command lines of failed build steps.
INFO: Elapsed time: 0.845s, Critical Path: 0.28s
INFO: 9 processes: 4 internal, 2 linux-sandbox, 3 worker.
ERROR: Build did NOT complete successfully |
Good catch @davido, I truly believe that Bazel keeps some local reference on the disk cache that it was populated from a remote source. When you do not specify the remote source anymore in the subsequent commands, Bazel blows up with the error you've shown, which is misleading because it isn't really a network transfer problem at all. I wrongly assumed that we had issues with our remote cache storage, but that wasn't the case. |
Thanks for the repro! I am looking into the issue now. |
I understand the issue now. Since 7.0.0, Bazel uses Looking at the error builds in the CI, the scenario might be:
For the repro, wiping out the disk cache could also trigger the error for the same reason: Bazel didn't download outputs during last build, when it needs the output now but fails to "download" from disk cache, it reports Internally, Bazel indeed keeps some references to the disk or remote cache because when building with From the CI setup, it seems that you want to populate the disk cache using remote cache during the first build. If so, I would suggest setting |
This is more like a documentation issue, not a real bug in Bazel. Downgrading the priority. |
Should this be considered a breaking change in Bazel 7 compared to 6? I guess the default behaviour has changed in a non-backward compatible way. Thanks for the suggestions, I am adding the That doesn't impact our build time because we always start the build with a pre-warmed Docker image that has an initial build completed. I have actually noticed that the image built was very small compared to the previous releases, which means that a lot of data was not stored anymore locally. I agree to downgrading to a P2. |
Yes, it's a breaking change. It is highlighted in the release notes: https://blog.bazel.build/2023/12/11/bazel-7-release.html#build-without-the-bytes-bwob, we probably should've made it more clear that it's a breaking change. |
Bazel expects the cache to be remote for all executions of the initial build of Gerrit, because of [1]. Failing to set the remote cache server URL would result in a file transfer failure and therefore the failure of the whole build. Adding the "build $BAZEL_OPTS" in the user.bazelrc resolves the problem. Also add the explicit fetch of the PolyGerrit NPM repositories which aren't automatically fetched when the workspace is refreshed from the remote Git repository. This is mandatory for preventing the build to fail when using the local disk cache, [1] bazelbuild/bazel#20161 Bug: Issue 316936462 Change-Id: I4cd6b4167c4025fe05d852d16f7cea5042046787
I'm trying to upgrade bazel 7.0 in our iOS project. All things work fine in bazel 6.3.2. But when I upgraded bazel to 7.0, I also met the same issue. As mentioned above, it seems that this problem occurs when both disk and remote cache are used. But I'm pretty sure I'm not using disk cache and RBE. Here is the outputs:
Additional notes: I'm using a
|
passing |
@coeuvre Just ran into this with |
Just +1 that im seeing a similar issue:
Our setup is a bit different though as were testing with 7.1.1 and:
How can we have issues downloading here since BwtB is disabled? |
Description of the bug:
Gerrit Code Review is in process of upgrading to bazel 7.0.0.
All was fine after the upgrade to 7.0.0rc2, see: [1].
However, after upgrading to the 7.0.0rc3 we started to see this breakage on our CI:
https://gerrit-ci.gerritforge.com/job/Gerrit-verifier-chrome-latest/40214/console
If I downgrade to 7.0.0.rc2, then the build is successful again: [1]
[1] https://gerrit-review.googlesource.com/c/gerrit/+/391534
Which category does this issue belong to?
No response
What's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.
I cannot currently reproduce the problem locally ;-(
This command is invoked on the CI:
That is created a shell script and invoking it to publish Plugin API artifacts in the local maven repository.
Which operating system are you running Bazel on?
Linux
What is the output of
bazel info release
?7.0.0rc3
If
bazel info release
returnsdevelopment version
or(@non-git)
, tell us how you built Bazel.No response
What's the output of
git remote get-url origin; git rev-parse master; git rev-parse HEAD
?No response
Is this a regression? If yes, please try to identify the Bazel commit where the bug was introduced.
All is fine on Bazel 7.0.0.rc2. I am unable to reproduce the problem locally and this cannot bisect.
Have you found anything relevant by searching the web?
No response
Any other information, logs, or outputs that you want to share?
No response
The text was updated successfully, but these errors were encountered: