Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BES with --bes_upload_mode=fully_async uses cancelled Context when uploading to CAS and errors #11392

Closed
illicitonion opened this issue May 14, 2020 · 5 comments
Assignees
Labels
P2 We'll consider working on this in future. (Assignee optional) team-Remote-Exec Issues and PRs for the Execution (Remote) team type: bug

Comments

@illicitonion
Copy link
Contributor

Description of the problem / feature request:

Using both --bes_upload_mode=fully_async and --remote_executor reliably causes a failed RPC (for all but the first invocation), which gets logged as:

WARNING: The background upload of the Build Event Protocol for the previous invocation failed with the following exception: 'com.google.devtools.build.lib.util.AbruptExitException: Unable to write all BEP events to file due to 'java.io.IOException: findMissingBlobs(1) for bes-upload''. Ignoring the failure and starting a new invocation...

Bugs: what's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

$ touch WORKSPACE
$ cat >BUILD <<EOF
genrule(
    name = "gr",
    cmd = "",
    outs = ["out"],
)
EOF
$ bazel --nohome_rc --nosystem_rc aquery //:gr --build_event_json_file=/tmp/events.json --bes_upload_mode=fully_async --remote_executor=grpc://localhost:8080 --remote_instance_name=remote-execution
$ bazel --nohome_rc --nosystem_rc aquery //:gr --build_event_json_file=/tmp/events.json --bes_upload_mode=fully_async --remote_executor=grpc://localhost:8080 --remote_instance_name=remote-execution

What operating system are you running Bazel on?

OSX

What's the output of bazel info release?

Tried with both release 3.1.0 and

$ USE_BAZEL_VERSION=last_green bazelisk info release                                                                                                                                                        
2020/05/14 14:11:46 Using unreleased version at commit a6e1d2e43abf2d0d330991949fad882c197a639a
INFO: Invocation ID: 82e739c6-49d8-46f4-94fe-f3fdffd2f8f6
development version

Any other information, logs, or outputs that you want to share?

This appears to be because something is cancelling the gRPC Context used here for a FindMissingBlobs for the BES upload.

Note that this bug also manifests if you're writing to a remote BES, not just a local JSON file - the local JSON file just seemed like a more minimal case to investigate.

This RPC adheres to the --remote_retries flag, so if that value is set greater than a small number, this upload will be retried in a loop until a 5 second timeout is hit and this will be logged:

WARNING: The background upload of the Build Event Protocol for the previous invocation failed to complete in 5.004 seconds. Cancelling and starting a new invocation...

The --fully_async flag is presumably intended to speed things up, so blocking the following command for a full 5 seconds while it tries to finish up the previous build is counter-productive.

It looks like something needs to change around either how Context lifetimes are managed (to not cancel them until background requests are complete), or about which Context these background requests happen in, but I'm not sure which approach would be better (or where the code which should manage these lifetimes should live). I'd appreciate pointers (or fixes!) :)

@jin jin added team-Core Skyframe, bazel query, BEP, options parsing, bazelrc untriaged labels May 15, 2020
bazel-io pushed a commit that referenced this issue May 18, 2020
This was helpful in debugging #11392

Closes #11396.

PiperOrigin-RevId: 312077991
@janakdr janakdr added team-Remote-Exec Issues and PRs for the Execution (Remote) team and removed team-Core Skyframe, bazel query, BEP, options parsing, bazelrc labels Sep 21, 2020
@janakdr
Copy link
Contributor

janakdr commented Sep 21, 2020

I'm guessing this is actually a bug with the remote worker, since it requires the remote_executor option to trigger it. Passing over to remote-exec team.

@ckhamly
Copy link

ckhamly commented Nov 16, 2020

Did you guys figure it out the issue? I am having the same problem when I enable --bes_backend and --remote_cache options together. My BES server gets the following

Received gRPC status code: Canceled message: context canceled

while bazel gets this

WARNING: The background upload of the Build Event Protocol for the previous invocation failed with the following exception: 'com.google.devtools.build.lib.util.AbruptExitException: The Build Event Protocol local file upload failed: findMissingBlobs(1) for bes-upload: CANCELLED: io.grpc.Context was cancelled without error'. Ignoring the failure and starting a new invocation...

@coeuvre coeuvre self-assigned this Nov 17, 2020
@coeuvre coeuvre added type: bug P2 We'll consider working on this in future. (Assignee optional) and removed untriaged labels Nov 17, 2020
@amkartashov
Copy link

I guess the problem is that remote cache client and build events uploader are sharing the same context somehow. See https://github.com/bazelbuild/bazel/blob/master/src/main/java/com/google/devtools/build/lib/remote/RemoteModule.java#L498

Trace I see in java.log if using remote_cache together with --bes_backend and --bes_upload_mode=fully_async OR --bes_upload_mode=nowait_for_upload_to_complete:

210315 15:54:53.243:WT 34 [com.google.devtools.build.lib.buildeventservice.BuildEventServiceUploader.waitForUploads] Failed to upload files referenced by build event: java.io.IOException: findMissingBlobs(1) for bes-upload: CANCELLED: io.grpc.Context was cancelled without error
java.util.concurrent.ExecutionException: java.io.IOException: findMissingBlobs(1) for bes-upload: CANCELLED: io.grpc.Context was cancelled without error
        at com.google.common.util.concurrent.AbstractFuture.getDoneValue(AbstractFuture.java:564)
        at com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:545)
        at com.google.common.util.concurrent.FluentFuture$TrustedFuture.get(FluentFuture.java:88)
        at com.google.devtools.build.lib.buildeventservice.BuildEventServiceUploader.waitForUploads(BuildEventServiceUploader.java:671)
        at com.google.devtools.build.lib.buildeventservice.BuildEventServiceUploader.publishBuildEvents(BuildEventServiceUploader.java:446)
        at com.google.devtools.build.lib.buildeventservice.BuildEventServiceUploader.run(BuildEventServiceUploader.java:316)
        at java.base/java.lang.Thread.run(Unknown Source)
Caused by: java.io.IOException: findMissingBlobs(1) for bes-upload: CANCELLED: io.grpc.Context was cancelled without error
        at com.google.devtools.build.lib.remote.GrpcCacheClient.lambda$findMissingDigests$1(GrpcCacheClient.java:213)
        at com.google.common.util.concurrent.AbstractCatchingFuture$AsyncCatchingFuture.doFallback(AbstractCatchingFuture.java:192)
        at com.google.common.util.concurrent.AbstractCatchingFuture$AsyncCatchingFuture.doFallback(AbstractCatchingFuture.java:179)
        at com.google.common.util.concurrent.AbstractCatchingFuture.run(AbstractCatchingFuture.java:124)
        at com.google.common.util.concurrent.DirectExecutor.execute(DirectExecutor.java:30)
        at com.google.common.util.concurrent.AbstractFuture.executeListener(AbstractFuture.java:1174)
        at com.google.common.util.concurrent.AbstractFuture.complete(AbstractFuture.java:969)
        at com.google.common.util.concurrent.AbstractFuture.setException(AbstractFuture.java:760)
        at com.google.common.util.concurrent.AggregateFuture.handleException(AggregateFuture.java:197)
        at com.google.common.util.concurrent.AggregateFuture.collectValueFromNonCancelledFuture(AggregateFuture.java:251)
        at com.google.common.util.concurrent.AggregateFuture.access$100(AggregateFuture.java:42)
        at com.google.common.util.concurrent.AggregateFuture$1.run(AggregateFuture.java:139)
        at com.google.common.util.concurrent.DirectExecutor.execute(DirectExecutor.java:30)
        at com.google.common.util.concurrent.AbstractFuture.executeListener(AbstractFuture.java:1174)
        at com.google.common.util.concurrent.AbstractFuture.complete(AbstractFuture.java:969)
        at com.google.common.util.concurrent.AbstractFuture.setFuture(AbstractFuture.java:800)
        at com.google.common.util.concurrent.AbstractCatchingFuture$AsyncCatchingFuture.setResult(AbstractCatchingFuture.java:203)
        at com.google.common.util.concurrent.AbstractCatchingFuture$AsyncCatchingFuture.setResult(AbstractCatchingFuture.java:179)
        at com.google.common.util.concurrent.AbstractCatchingFuture.run(AbstractCatchingFuture.java:133)
        at com.google.common.util.concurrent.DirectExecutor.execute(DirectExecutor.java:30)
        at com.google.common.util.concurrent.AbstractFuture.executeListener(AbstractFuture.java:1174)
        at com.google.common.util.concurrent.AbstractFuture.complete(AbstractFuture.java:969)
        at com.google.common.util.concurrent.AbstractFuture.setFuture(AbstractFuture.java:800)
        at com.google.common.util.concurrent.AbstractCatchingFuture$AsyncCatchingFuture.setResult(AbstractCatchingFuture.java:203)
        at com.google.common.util.concurrent.AbstractCatchingFuture$AsyncCatchingFuture.setResult(AbstractCatchingFuture.java:179)
        at com.google.common.util.concurrent.AbstractCatchingFuture.run(AbstractCatchingFuture.java:133)
        at com.google.common.util.concurrent.DirectExecutor.execute(DirectExecutor.java:30)
        at com.google.common.util.concurrent.AbstractFuture.executeListener(AbstractFuture.java:1174)
        at com.google.common.util.concurrent.AbstractFuture.complete(AbstractFuture.java:969)
        at com.google.common.util.concurrent.AbstractFuture.setException(AbstractFuture.java:760)
        at io.grpc.stub.ClientCalls$GrpcFuture.setException(ClientCalls.java:563)
        at io.grpc.stub.ClientCalls$UnaryStreamToFuture.onClose(ClientCalls.java:533)
        at io.grpc.internal.ClientCallImpl.closeObserver(ClientCallImpl.java:413)
        at io.grpc.internal.ClientCallImpl.access$500(ClientCallImpl.java:66)
        at io.grpc.internal.ClientCallImpl$1CloseInContext.runInContext(ClientCallImpl.java:403)
        at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
        at io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:123)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
        ... 1 more
Caused by: io.grpc.StatusRuntimeException: CANCELLED: io.grpc.Context was cancelled without error
        at io.grpc.Status.asRuntimeException(Status.java:533)
        ... 9 more
210315 15:54:53.244:I 34 [com.google.devtools.build.lib.buildeventservice.BuildEventServiceUploader.publishBuildEvents] Publish interrupt. Showing up to 30 items from queues: ack_queue_size: 2, ack_queue: [SendRegularBuildEventCommand{seq_num=21} - [com.google.devtools.build.lib.buildeventstream.ProgressEvent@3f0509cc], SendRegularBuildEventCommand{seq_num=22} - [com.google.devtools.build.lib.runtime.NamedArtifactGroup@4179927c]], event_queue_size: 7, event_queue: [SendRegularBuildEventCommand{seq_num=23} - [com.google.devtools.build.lib.analysis.TargetCompleteEvent@5f869cf8], SendRegularBuildEventCommand{seq_num=24} - [com.google.devtools.build.lib.buildtool.buildevent.BuildCompleteEvent@13f6f5c3], SendRegularBuildEventCommand{seq_num=25} - [com.google.devtools.build.lib.metrics.BuildMetricsEvent@4e1ecfb4], SendRegularBuildEventCommand{seq_num=26} - [com.google.devtools.build.lib.buildeventstream.ProgressEvent@1e916fb3], SendRegularBuildEventCommand{seq_num=27} - [com.google.devtools.build.lib.buildeventstream.LastBuildEvent@7688363d], SendLastBuildEventCommand{seq_num=28}, AckReceivedCommand{seq_num=21}]
210315 15:54:53.245:I 34 [com.google.devtools.build.lib.buildeventservice.BuildEventServiceUploader.publishBuildEvents] About to cancel all local file uploads
210315 15:54:53.245:I 34 [com.google.devtools.build.lib.buildeventservice.BuildEventServiceUploader.publishBuildEvents] Spent 0 milliseconds doing local file upload cancelation
210315 15:54:53.597:X 34 [com.google.devtools.build.lib.buildeventservice.BuildEventServiceUploader.logAndSetException] The Build Event Protocol local file upload failed:

Note: trace appears in log file as soon as I invoke any next bazel command (build or run or shutdown).

@brentleyjones
Copy link
Contributor

brentleyjones commented Sep 13, 2021

The original repo case is fixed in 4.1.0+. I only get the following warning on 4.0.0 and 3.x:

WARNING: The background upload of the Build Event Protocol for the previous invocation failed with the following exception: 'com.google.devtools.build.lib.util.AbruptExitException: Unable to write all BEP events to file due to 'java.io.IOException: findMissingBlobs(1) for bes-upload: CANCELLED: io.grpc.Context was cancelled without error''. Ignoring the failure and starting a new invocation...

@coeuvre
Copy link
Member

coeuvre commented Sep 14, 2021

Closing since I think this is fixed by recent improvements.

@coeuvre coeuvre closed this as completed Sep 14, 2021
luca-digrazia pushed a commit to luca-digrazia/DatasetCommitsDiffSearch that referenced this issue Sep 4, 2022
    This was helpful in debugging bazelbuild/bazel#11392

    Closes #11396.

    PiperOrigin-RevId: 312077991
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P2 We'll consider working on this in future. (Assignee optional) team-Remote-Exec Issues and PRs for the Execution (Remote) team type: bug
Projects
None yet
Development

No branches or pull requests

8 participants