-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bazel dynamic execution failures with (at least) buildfarm #12364
Comments
@jmmv for visibility |
I can reproduce with EngFlow RE. |
The problem is the remote cache. The RemoteModule implicitly enables the remote cache when
I don't see a way to disable the remote cache with the existing flags. :-( The longer explanation is that enabling dynamic execution like this attempts to run remote execution and local execution w/ remote cache in parallel, and that doesn't make sense. |
I can change AbstactSpawnStrategy:117 like this to avoid the error:
The question is whether that's correct. |
Fixes bazelbuild#12364. Change-Id: Id664635041392f19710e7bc064b8c0b03111d5bc
It is correct, but more by accident than by design. I was wondering about the combination of dynamic execution with local w/ disk cache and remote execution. However, this is an unsupported combination at this time and results in an error in RemoteModule (which also handles the disk cache). |
Fixes bazelbuild#12364. Change-Id: Id664635041392f19710e7bc064b8c0b03111d5bc
I have a potential fix here: ulfjack@65d308d Btw. I diagnosed this by changing |
When dynamic execution is active, disable the spawn cache. The RemoteModule always provides a remote spawn cache when remote execution is active. However, when dynamic execution is active, we don't want to use remote execution *and* local execution with remote caching. In fact, trying to use both leads to an assertion error in the new dynamic execution implementation. The downside of this change is that it no longer allows us to use the remote cache for actions that disallow remote execution. It also doesn't allow using the disk cache for the local branch (although that is also currently not allowed in the RemoteModule). Fixes bazelbuild#12364. Change-Id: I5aa192143604b01a65d68ac73f1d363957503396
An alternative fix is to disable the spawn cache in the DynamicExecutionModule: Neither of these fixes is ideal. The entire class hierarchy isn't ideal. |
Yes, remote execution will implicitly enables the remote cache. IMHO, dynamic execution should disable remote cache for local execution in this case. |
Lars has prepared a fix in https://bazel-review.googlesource.com/c/bazel/+/145250 |
Thanks, @larsrc-google! |
Fixes bazelbuild#12364 (caused by bazelbuild@25e58ff) based on Ulf's approach, but avoiding secretly disabling some potential caches that should not be disabled. RELNOTES: n/a PiperOrigin-RevId: 341588598
Description of the problem / feature request:
We have had some luck in being able to use bazel's dynamic execution (e.g. --experimental_spawn_scheduler) with buildfarm to compile C++ code, with bazel 3.1.0 and earlier. With bazels later than that, we have started to see builds fail when using this. The failures are more difficult to reproduce with the "old" dynamic execution mechanism (--internal_spawn_scheduler --spawn_strategy=dynamic), but they seem to be easy to reproduce with the "new" dynamic execution mechanism (--internal_spawn_scheduler --spawn_strategy=dynamic --legacy_spawn_scheduler=false).
Bugs: what's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.
Please see the attached
example.tar.gz
.example.tar.gz
The enclosed
README
file has instructions, which I will repeat here:Here are directions for reproducing issues with the new version of the
experimental spawn scheduler in conjunction with (at least) buildfarm.
For best results, you will likely need two systems: One to run the
buildfarm processes, and one to run the bazel client.
I happened to use the following: a 36-cpu desktop system with 128 GB RAM
to run the buildfarm processes, and an 8-cpu AWS instance with 64 GB RAM
to run the bazel client. Both were running similar configurations of
Ubuntu 18.04 (similar compilers, etc).
The scripts herein default to running bazel as simply "bazel". You can
set the environment variable BAZEL to override this.
On the system where you will be running buildfarm, clone the buildfarm
repo and run the enclosed "bf-mem-example" script. Note that this
script will run bazel, so you might need to set BAZEL to help it to find
the bazel that you want it to use.
On the system where you will be running the bazel client, you will need
all the rest of the files enclosed here. As above, you might need to set
BAZEL to help the scripts find the bazel that you want them to use. In
addition, you will need to set BUILDFARM to the DNS name or IP address
of the system where you are running buildfarm.
First, build the tree with remote execution:
This should complete without any problems.
The try.{0,1,2,3} scripts will do the same build, but with various spawn
scheduler configurations. To use the simplest configuration with the
new version of the spawn scheduler, use try.1:
It does not always fail on the first try, but it nearly always fails
without too many repetitions.
What operating system are you running Bazel on?
Ubuntu 18.04
What's the output of
bazel info release
?release 3.7.0
Have you found anything relevant by searching the web?
The following bazel-discuss thread: https://groups.google.com/g/bazel-discuss/c/xEWci2lcTzw/m/hZJJ1LPiBgAJ
Any other information, logs, or outputs that you want to share?
Here is an example failure produced with bazel-3.7.0 using the
try.1
script in the attached example:The text was updated successfully, but these errors were encountered: