Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bazel dynamic execution failures with (at least) buildfarm #12364

Closed
dws opened this issue Oct 27, 2020 · 11 comments
Closed

bazel dynamic execution failures with (at least) buildfarm #12364

dws opened this issue Oct 27, 2020 · 11 comments
Assignees
Labels
P1 I'll work on this now. (Assignee required) team-Performance Issues for Performance teams type: bug

Comments

@dws
Copy link
Contributor

dws commented Oct 27, 2020

Description of the problem / feature request:

We have had some luck in being able to use bazel's dynamic execution (e.g. --experimental_spawn_scheduler) with buildfarm to compile C++ code, with bazel 3.1.0 and earlier. With bazels later than that, we have started to see builds fail when using this. The failures are more difficult to reproduce with the "old" dynamic execution mechanism (--internal_spawn_scheduler --spawn_strategy=dynamic), but they seem to be easy to reproduce with the "new" dynamic execution mechanism (--internal_spawn_scheduler --spawn_strategy=dynamic --legacy_spawn_scheduler=false).

Bugs: what's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

Please see the attached example.tar.gz.

example.tar.gz

The enclosed README file has instructions, which I will repeat here:

Here are directions for reproducing issues with the new version of the
experimental spawn scheduler in conjunction with (at least) buildfarm.

For best results, you will likely need two systems: One to run the
buildfarm processes, and one to run the bazel client.

I happened to use the following: a 36-cpu desktop system with 128 GB RAM
to run the buildfarm processes, and an 8-cpu AWS instance with 64 GB RAM
to run the bazel client. Both were running similar configurations of
Ubuntu 18.04 (similar compilers, etc).

The scripts herein default to running bazel as simply "bazel". You can
set the environment variable BAZEL to override this.

On the system where you will be running buildfarm, clone the buildfarm
repo and run the enclosed "bf-mem-example" script. Note that this
script will run bazel, so you might need to set BAZEL to help it to find
the bazel that you want it to use.

git clone git@github.com:bazelbuild/bazel-buildfarm.git bazel-buildfarm0
cd bazel-buildfarm0
# run the bf-mem-example script enclosed herein

On the system where you will be running the bazel client, you will need
all the rest of the files enclosed here. As above, you might need to set
BAZEL to help the scripts find the bazel that you want them to use. In
addition, you will need to set BUILDFARM to the DNS name or IP address
of the system where you are running buildfarm.

First, build the tree with remote execution:

./try.baseline

This should complete without any problems.

The try.{0,1,2,3} scripts will do the same build, but with various spawn
scheduler configurations. To use the simplest configuration with the
new version of the spawn scheduler, use try.1:

./try.1

It does not always fail on the first try, but it nearly always fails
without too many repetitions.

What operating system are you running Bazel on?

Ubuntu 18.04

What's the output of bazel info release?

release 3.7.0

Have you found anything relevant by searching the web?

The following bazel-discuss thread: https://groups.google.com/g/bazel-discuss/c/xEWci2lcTzw/m/hZJJ1LPiBgAJ

Any other information, logs, or outputs that you want to share?

Here is an example failure produced with bazel-3.7.0 using the try.1 script in the attached example:

+ . try.rc
++ bazelrc=bazel.rc
++ cat bazel.rc
build --curses=no
build --color=no

# Dynamic execution
#
# --config=dynamic-execution0   --experimental_spawn_scheduler with --local_cpu_resources=HOST_CPUS*0.75 --local_ram_resources=HOST_RAM*0.75
# --config=dynamic-execution1   Use the new dynamic scheduler
# --config=dynamic-execution2   Use the new dynamic scheduler and --experimental_local_lockfree_output
# --config=dynamic-execution3   Use the new dynamic scheduler and --experimental_local_lockfree_output and --experimental_local_execution_delay=1000
#
# Note that units of --experimental_local_execution_delay are milliseconds.

build:dynamic-execution0 --local_cpu_resources=HOST_CPUS*0.75
build:dynamic-execution0 --local_ram_resources=HOST_RAM*0.75
build:dynamic-execution0 --internal_spawn_scheduler
build:dynamic-execution0 --spawn_strategy=dynamic

build:dynamic-execution1 --config=dynamic-execution0
build:dynamic-execution1 --legacy_spawn_scheduler=false

build:dynamic-execution2 --config=dynamic-execution1
build:dynamic-execution2 --experimental_local_lockfree_output

build:dynamic-execution3 --config=dynamic-execution2
build:dynamic-execution3 --experimental_local_execution_delay=1000
++ startup='--nosystem_rc --nohome_rc --noworkspace_rc --bazelrc=bazel.rc'
++ target=//:hello_world
++ jobs=--jobs=128
++ : dws-7910.corp.uber.com
++ remote=--remote_executor=grpc://dws-7910.corp.uber.com:8980
++ : ./bazel-3.7.0
+ ./bazel-3.7.0 --nosystem_rc --nohome_rc --noworkspace_rc --bazelrc=bazel.rc clean
INFO: Starting clean (this may take a while). Consider using --async if the clean takes more than several minutes.
+ ./bazel-3.7.0 --nosystem_rc --nohome_rc --noworkspace_rc --bazelrc=bazel.rc build --verbose_failures --jobs=128 --remote_executor=grpc://dws-7910.corp.uber.com:8980 --config=dynamic-execution1 //:hello_world
INFO: Invocation ID: d8ecc266-5f7d-492b-97be-51fb9bfcea40
Loading: 
Loading: 0 packages loaded
Analyzing: target //:hello_world (1 packages loaded, 0 targets configured)
INFO: Analyzed target //:hello_world (15 packages loaded, 51 targets configured).
INFO: Found 1 target...
[0 / 4] [Prepa] BazelWorkspaceStatusAction stable-status.txt
WARNING: Reading from Remote Cache:
java.io.FileNotFoundException: /home/dws/.cache/bazel/_bazel_dws/7c75671892e56d02ca221ec93312907e/execroot/__main__/bazel-out/k8-fastbuild/bin/_objs/hello_world/hello_world.pic.d.tmp (No such file or directory)
	at com.google.devtools.build.lib.unix.NativePosixFiles.lstat(Native Method)
	at com.google.devtools.build.lib.unix.UnixFileSystem.statInternal(UnixFileSystem.java:185)
	at com.google.devtools.build.lib.unix.UnixFileSystem.stat(UnixFileSystem.java:175)
	at com.google.devtools.build.lib.vfs.Path.stat(Path.java:418)
	at com.google.devtools.build.lib.vfs.FileSystemUtils.moveFile(FileSystemUtils.java:454)
	at com.google.devtools.build.lib.remote.RemoteCache.moveOutputsToFinalLocation(RemoteCache.java:415)
	at com.google.devtools.build.lib.remote.RemoteCache.download(RemoteCache.java:374)
	at com.google.devtools.build.lib.remote.RemoteSpawnCache.lookup(RemoteSpawnCache.java:183)
	at com.google.devtools.build.lib.exec.AbstractSpawnStrategy.exec(AbstractSpawnStrategy.java:129)
	at com.google.devtools.build.lib.dynamic.DynamicSpawnStrategy.runLocally(DynamicSpawnStrategy.java:429)
	at com.google.devtools.build.lib.dynamic.DynamicSpawnStrategy.access$200(DynamicSpawnStrategy.java:69)
	at com.google.devtools.build.lib.dynamic.DynamicSpawnStrategy$1.callImpl(DynamicSpawnStrategy.java:311)
	at com.google.devtools.build.lib.dynamic.DynamicSpawnStrategy$Branch.call(DynamicSpawnStrategy.java:522)
	at com.google.devtools.build.lib.dynamic.DynamicSpawnStrategy$Branch.call(DynamicSpawnStrategy.java:459)
	at com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:125)
	at com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:69)
	at com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:78)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
	at java.base/java.lang.Thread.run(Unknown Source)
Internal error thrown during build. Printing stack trace: java.lang.AssertionError: stopBranch called more than once by local
	at com.google.devtools.build.lib.dynamic.DynamicSpawnStrategy.stopBranch(DynamicSpawnStrategy.java:135)
	at com.google.devtools.build.lib.dynamic.DynamicSpawnStrategy.access$300(DynamicSpawnStrategy.java:69)
	at com.google.devtools.build.lib.dynamic.DynamicSpawnStrategy$1.lambda$callImpl$0(DynamicSpawnStrategy.java:314)
	at com.google.devtools.build.lib.exec.AbstractSpawnStrategy$SpawnExecutionContextImpl.lockOutputFiles(AbstractSpawnStrategy.java:258)
	at com.google.devtools.build.lib.sandbox.AbstractSandboxSpawnRunner.runSpawn(AbstractSandboxSpawnRunner.java:127)
	at com.google.devtools.build.lib.sandbox.AbstractSandboxSpawnRunner.exec(AbstractSandboxSpawnRunner.java:88)
	at com.google.devtools.build.lib.sandbox.SandboxModule$SandboxFallbackSpawnRunner.exec(SandboxModule.java:473)
	at com.google.devtools.build.lib.exec.SpawnRunner.execAsync(SpawnRunner.java:240)
	at com.google.devtools.build.lib.exec.AbstractSpawnStrategy.exec(AbstractSpawnStrategy.java:134)
	at com.google.devtools.build.lib.dynamic.DynamicSpawnStrategy.runLocally(DynamicSpawnStrategy.java:429)
	at com.google.devtools.build.lib.dynamic.DynamicSpawnStrategy.access$200(DynamicSpawnStrategy.java:69)
	at com.google.devtools.build.lib.dynamic.DynamicSpawnStrategy$1.callImpl(DynamicSpawnStrategy.java:311)
	at com.google.devtools.build.lib.dynamic.DynamicSpawnStrategy$Branch.call(DynamicSpawnStrategy.java:522)
	at com.google.devtools.build.lib.dynamic.DynamicSpawnStrategy$Branch.call(DynamicSpawnStrategy.java:459)
	at com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:125)
	at com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:69)
	at com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:78)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
	at java.base/java.lang.Thread.run(Unknown Source)

INFO: Elapsed time: 0.543s, Critical Path: 0.01s
INFO: 3 processes: 3 internal.
FAILED: Build did NOT complete successfully
Internal error thrown during build. Printing stack trace: java.lang.AssertionError: stopBranch called more than once by local
	at com.google.devtools.build.lib.dynamic.DynamicSpawnStrategy.stopBranch(DynamicSpawnStrategy.java:135)
	at com.google.devtools.build.lib.dynamic.DynamicSpawnStrategy.access$300(DynamicSpawnStrategy.java:69)
	at com.google.devtools.build.lib.dynamic.DynamicSpawnStrategy$1.lambda$callImpl$0(DynamicSpawnStrategy.java:314)
	at com.google.devtools.build.lib.exec.AbstractSpawnStrategy$SpawnExecutionContextImpl.lockOutputFiles(AbstractSpawnStrategy.java:258)
	at com.google.devtools.build.lib.sandbox.AbstractSandboxSpawnRunner.runSpawn(AbstractSandboxSpawnRunner.java:127)
	at com.google.devtools.build.lib.sandbox.AbstractSandboxSpawnRunner.exec(AbstractSandboxSpawnRunner.java:88)
	at com.google.devtools.build.lib.sandbox.SandboxModule$SandboxFallbackSpawnRunner.exec(SandboxModule.java:473)
	at com.google.devtools.build.lib.exec.SpawnRunner.execAsync(SpawnRunner.java:240)
	at com.google.devtools.build.lib.exec.AbstractSpawnStrategy.exec(AbstractSpawnStrategy.java:134)
	at com.google.devtools.build.lib.dynamic.DynamicSpawnStrategy.runLocally(DynamicSpawnStrategy.java:429)
	at com.google.devtools.build.lib.dynamic.DynamicSpawnStrategy.access$200(DynamicSpawnStrategy.java:69)
	at com.google.devtools.build.lib.dynamic.DynamicSpawnStrategy$1.callImpl(DynamicSpawnStrategy.java:311)
	at com.google.devtools.build.lib.dynamic.DynamicSpawnStrategy$Branch.call(DynamicSpawnStrategy.java:522)
	at com.google.devtools.build.lib.dynamic.DynamicSpawnStrategy$Branch.call(DynamicSpawnStrategy.java:459)
	at com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:125)
	at com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:69)
	at com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:78)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
	at java.base/java.lang.Thread.run(Unknown Source)
java.lang.AssertionError: stopBranch called more than once by local
	at com.google.devtools.build.lib.dynamic.DynamicSpawnStrategy.stopBranch(DynamicSpawnStrategy.java:135)
	at com.google.devtools.build.lib.dynamic.DynamicSpawnStrategy.access$300(DynamicSpawnStrategy.java:69)
	at com.google.devtools.build.lib.dynamic.DynamicSpawnStrategy$1.lambda$callImpl$0(DynamicSpawnStrategy.java:314)
	at com.google.devtools.build.lib.exec.AbstractSpawnStrategy$SpawnExecutionContextImpl.lockOutputFiles(AbstractSpawnStrategy.java:258)
	at com.google.devtools.build.lib.sandbox.AbstractSandboxSpawnRunner.runSpawn(AbstractSandboxSpawnRunner.java:127)
	at com.google.devtools.build.lib.sandbox.AbstractSandboxSpawnRunner.exec(AbstractSandboxSpawnRunner.java:88)
	at com.google.devtools.build.lib.sandbox.SandboxModule$SandboxFallbackSpawnRunner.exec(SandboxModule.java:473)
	at com.google.devtools.build.lib.exec.SpawnRunner.execAsync(SpawnRunner.java:240)
	at com.google.devtools.build.lib.exec.AbstractSpawnStrategy.exec(AbstractSpawnStrategy.java:134)
	at com.google.devtools.build.lib.dynamic.DynamicSpawnStrategy.runLocally(DynamicSpawnStrategy.java:429)
	at com.google.devtools.build.lib.dynamic.DynamicSpawnStrategy.access$200(DynamicSpawnStrategy.java:69)
	at com.google.devtools.build.lib.dynamic.DynamicSpawnStrategy$1.callImpl(DynamicSpawnStrategy.java:311)
	at com.google.devtools.build.lib.dynamic.DynamicSpawnStrategy$Branch.call(DynamicSpawnStrategy.java:522)
	at com.google.devtools.build.lib.dynamic.DynamicSpawnStrategy$Branch.call(DynamicSpawnStrategy.java:459)
	at com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:125)
	at com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:69)
	at com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:78)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
	at java.base/java.lang.Thread.run(Unknown Source)
FAILED: Build did NOT complete successfully
@dws dws changed the title bazel dynamic execution failures bazel dynamic execution failures with (at least) buildfarm Oct 27, 2020
@dws
Copy link
Contributor Author

dws commented Oct 27, 2020

@jmmv for visibility

@meisterT meisterT added P1 I'll work on this now. (Assignee required) team-Performance Issues for Performance teams labels Oct 28, 2020
@ulfjack
Copy link
Contributor

ulfjack commented Oct 28, 2020

I can reproduce with EngFlow RE.

@ulfjack
Copy link
Contributor

ulfjack commented Oct 28, 2020

The problem is the remote cache. The RemoteModule implicitly enables the remote cache when --remote_executor is set, which doesn't make any sense for dynamic execution:

    if (enableRemoteExecution && Strings.isNullOrEmpty(remoteOptions.remoteCache)) {
      remoteOptions.remoteCache = remoteOptions.remoteExecutor;
    }

I don't see a way to disable the remote cache with the existing flags. :-(

The longer explanation is that enabling dynamic execution like this attempts to run remote execution and local execution w/ remote cache in parallel, and that doesn't make sense.

@ulfjack
Copy link
Contributor

ulfjack commented Oct 28, 2020

I think this was broken in 25e58ff. @coeuvre

@ulfjack
Copy link
Contributor

ulfjack commented Oct 28, 2020

I can change AbstactSpawnStrategy:117 like this to avoid the error:

    SpawnCache cache = stopConcurrentSpawns != null ? null : actionExecutionContext.getContext(SpawnCache.class);

The question is whether that's correct.

ulfjack added a commit to ulfjack/bazel that referenced this issue Oct 28, 2020
Fixes bazelbuild#12364.

Change-Id: Id664635041392f19710e7bc064b8c0b03111d5bc
@ulfjack
Copy link
Contributor

ulfjack commented Oct 28, 2020

It is correct, but more by accident than by design. I was wondering about the combination of dynamic execution with local w/ disk cache and remote execution. However, this is an unsupported combination at this time and results in an error in RemoteModule (which also handles the disk cache).

ulfjack added a commit to ulfjack/bazel that referenced this issue Oct 28, 2020
Fixes bazelbuild#12364.

Change-Id: Id664635041392f19710e7bc064b8c0b03111d5bc
@ulfjack
Copy link
Contributor

ulfjack commented Oct 28, 2020

I have a potential fix here: ulfjack@65d308d

Btw. I diagnosed this by changing DynamicSpawnStrategy.stopBranch to print a stack trace in the successful case, which immediately implicated the RemoteCache.

ulfjack added a commit to ulfjack/bazel that referenced this issue Oct 28, 2020
When dynamic execution is active, disable the spawn cache. The RemoteModule
always provides a remote spawn cache when remote execution is active. However,
when dynamic execution is active, we don't want to use remote execution *and*
local execution with remote caching.

In fact, trying to use both leads to an assertion error in the new dynamic
execution implementation.

The downside of this change is that it no longer allows us to use the remote
cache for actions that disallow remote execution. It also doesn't allow using
the disk cache for the local branch (although that is also currently not
allowed in the RemoteModule).

Fixes bazelbuild#12364.

Change-Id: I5aa192143604b01a65d68ac73f1d363957503396
@ulfjack
Copy link
Contributor

ulfjack commented Oct 28, 2020

An alternative fix is to disable the spawn cache in the DynamicExecutionModule:
ulfjack@aaa8f52

Neither of these fixes is ideal. The entire class hierarchy isn't ideal.

@coeuvre
Copy link
Member

coeuvre commented Nov 9, 2020

I think this was broken in 25e58ff.

Yes, remote execution will implicitly enables the remote cache. IMHO, dynamic execution should disable remote cache for local execution in this case.

@meisterT
Copy link
Member

Lars has prepared a fix in https://bazel-review.googlesource.com/c/bazel/+/145250

@ulfjack
Copy link
Contributor

ulfjack commented Nov 11, 2020

Thanks, @larsrc-google!

@philwo philwo mentioned this issue Nov 19, 2020
10 tasks
katre pushed a commit that referenced this issue Nov 19, 2020
Fixes #12364 (caused by 25e58ff) based on Ulf's approach, but avoiding secretly disabling some potential caches that should not be disabled.

RELNOTES: n/a
PiperOrigin-RevId: 341588598
ulfjack pushed a commit to EngFlow/bazel that referenced this issue Mar 5, 2021
Fixes bazelbuild#12364 (caused by bazelbuild@25e58ff) based on Ulf's approach, but avoiding secretly disabling some potential caches that should not be disabled.

RELNOTES: n/a
PiperOrigin-RevId: 341588598
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P1 I'll work on this now. (Assignee required) team-Performance Issues for Performance teams type: bug
Projects
None yet
Development

No branches or pull requests

6 participants