-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[7.0.0] Flaky builds with remote cache #20559
Comments
It looks like the connection was closed when Bazel was downloading the action outputs. Some clarifications:
|
The build with
No.
Google Cloud Storage.
and the |
Not sure if it's related, but also Gerrit Code Review project is affected by some changed behaviour in remote-cache code area, after upgrading to Bazel 7.0.0, see this issue. |
(FWIW, I don't think #20161 is related; we've narrowed down that one and it has a very different nature.) The exception occurs fairly deep in our networking dependencies, so there isn't a whole lot I can suggest to augment the logging. (I suppose we could try to plumb through some information to narrow down which particular blobs failed to download, but this looks like a transport-level issue rather than application-level, so I'm unsure how much that would help.) Since this issue isn't present in 6.x, I wonder if it was introduced by the netty upgrade in ff1dc3b. Would you be able to give a Bazel built at the parent commit (2e34965) a try? Bazelisk can help with this, i.e. But do note that the commit is 7 months old at this point, so it might be incompatible with your project / take some work to make compatible. |
I tried this commit. It did not work out of the box. I am not going to try very hard to make it work. We are going to revert to 6.4.0 in more builds, because the flakes are too frequent with 7.0.0. |
I also saw what I think is the same issue with Bazel 6.2.1 when we tried to use |
Assuming the "connection reset" error is transient and not permanent, this change to the retry logic should fix this issue, and will be in the 7.2.1 release. |
7.2.1 has been released and should contain a fix for this bug. Please comment on this issue if the problem persists. |
I just got this in a build with Bazel 7.2.1:
https://github.com/googleapis/google-cloud-cpp/runs/26956528060 |
Address #20559 (comment). PiperOrigin-RevId: 649360592 Change-Id: I50ac2ed3b54d6cffb5f96d3f8315279786c4f30f
6e2e3a1 should address this. |
Address bazelbuild#20559 (comment). PiperOrigin-RevId: 649360592 Change-Id: I50ac2ed3b54d6cffb5f96d3f8315279786c4f30f
Address #20559 (comment). PiperOrigin-RevId: 649360592 Change-Id: I50ac2ed3b54d6cffb5f96d3f8315279786c4f30f
That seems to make the timeout retryable and should fix the specific issue. Thanks. However, I wonder why the build stops on any caching error. Shouldn't that be treated as a cache miss, and the build continue? Is there a separate bug to fix that problem? |
Address bazelbuild#20559 (comment). PiperOrigin-RevId: 649360592 Change-Id: I50ac2ed3b54d6cffb5f96d3f8315279786c4f30f
What you described applies to the case where Bazel downloads outputs of a remote cached action. However, in this specific case, the download happened when Bazel is fetching remote inputs to local action. If the download times out, Bazel has no choice but to stop the build. |
Something is going over my head. The error I had in mind was:
The thing Bazel is trying to download is an artifact in the remote cache. Granted, it is the input for some action X, but it seems to me that it was created as the result of some action Y. Is there a reason Bazel cannot fallback to executing the action that created the (cached) artifact? |
Description of the bug:
Since we moved our builds to Bazel 7.0.0 our builds are randomly failing with
I/O exception during sandboxed execution
messages.We started to collect the failures on googleapis/google-cloud-cpp#13315 . A good example may be:
I speculate this may be the remote cache, but of course I could be wrong.
Which category does this issue belong to?
Core
What's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.
No simple repro at the moment, sorry.
If there is a way to enable more detailed logging I could do that on our builds and report any results on the next failure.
Which operating system are you running Bazel on?
Linux
What is the output of
bazel info release
?Starting local Bazel server and connecting to it... Build label: 7.0.0 Build target: @@//src/main/java/com/google/devtools/build/lib/bazel:BazelServer Build time: Mon Dec 11 16:51:49 2023 (1702313509) Build timestamp: 1702313509 Build timestamp as int: 1702313509
If
bazel info release
returnsdevelopment version
or(@non-git)
, tell us how you built Bazel.No response
What's the output of
git remote get-url origin; git rev-parse master; git rev-parse HEAD
?No response
Is this a regression? If yes, please try to identify the Bazel commit where the bug was introduced.
No response
Have you found anything relevant by searching the web?
No response
Any other information, logs, or outputs that you want to share?
No response
The text was updated successfully, but these errors were encountered: