Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reconnection may not have retry capacity #18764

Closed
werkt opened this issue Jun 24, 2023 · 5 comments
Closed

Reconnection may not have retry capacity #18764

werkt opened this issue Jun 24, 2023 · 5 comments
Labels
help wanted Someone outside the Bazel team could own this P2 We'll consider working on this in future. (Assignee optional) team-Remote-Exec Issues and PRs for the Execution (Remote) team type: bug

Comments

@werkt
Copy link
Contributor

werkt commented Jun 24, 2023

Description of the bug:

A build failure occurs when GrpcCacheClient cannot reach the remote endpoint in the Ssl handshake timeout. The local exception is wrapped in a grpc UNAVAILABLE StatusRuntimeException with a silent inner request error handler, bounded by downloadBlobs future aggregation.

(01:48:39) ERROR: /<redacted>/BUILD:41:9: Compiling REDACTED <redacted> failed: (Exit 34): io.grpc.StatusRuntimeException: UNAVAILABLE: io exception
Channel Pipeline: [SslHandler#0, ProtocolNegotiators$ClientTlsHandler#0, WriteBufferingAndExceptionHandler#0, DefaultChannelPipeline$TailContext#0]
com.google.devtools.build.lib.remote.common.BulkTransferException: io.grpc.StatusRuntimeException: UNAVAILABLE: io exception
Channel Pipeline: [SslHandler#0, ProtocolNegotiators$ClientTlsHandler#0, WriteBufferingAndExceptionHandler#0, DefaultChannelPipeline$TailContext#0]
  at com.google.devtools.build.lib.remote.util.Utils.waitForBulkTransfer(Utils.java:594)
  at com.google.devtools.build.lib.remote.RemoteExecutionService.downloadOutputs(RemoteExecutionService.java:1120)
  at com.google.devtools.build.lib.remote.RemoteSpawnRunner.downloadAndFinalizeSpawnResult(RemoteSpawnRunner.java:413)
  at com.google.devtools.build.lib.remote.RemoteSpawnRunner.exec(RemoteSpawnRunner.java:218)
  at com.google.devtools.build.lib.exec.SpawnRunner.execAsync(SpawnRunner.java:301)
  at com.google.devtools.build.lib.exec.AbstractSpawnStrategy.exec(AbstractSpawnStrategy.java:152)
  at com.google.devtools.build.lib.exec.AbstractSpawnStrategy.exec(AbstractSpawnStrategy.java:112)
  at com.google.devtools.build.lib.actions.SpawnStrategy.beginExecution(SpawnStrategy.java:47)
  at com.google.devtools.build.lib.exec.SpawnStrategyResolver.beginExecution(SpawnStrategyResolver.java:64)
  at com.google.devtools.build.lib.analysis.actions.SpawnAction.beginExecution(SpawnAction.java:352)
  at com.google.devtools.build.lib.actions.Action.execute(Action.java:133)
  at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor$5.execute(SkyframeActionExecutor.java:957)
  at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor$ActionRunner.continueAction(SkyframeActionExecutor.java:1124)
  at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor$ActionRunner.run(SkyframeActionExecutor.java:1082)
  at com.google.devtools.build.lib.skyframe.ActionExecutionState.runStateMachine(ActionExecutionState.java:160)
  at com.google.devtools.build.lib.skyframe.ActionExecutionState.getResultOrDependOnFuture(ActionExecutionState.java:93)
  at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor.executeAction(SkyframeActionExecutor.java:516)
  at com.google.devtools.build.lib.skyframe.ActionExecutionFunction.checkCacheAndExecuteIfNeeded(ActionExecutionFunction.java:827)
  at com.google.devtools.build.lib.skyframe.ActionExecutionFunction.computeInternal(ActionExecutionFunction.java:323)
  at com.google.devtools.build.lib.skyframe.ActionExecutionFunction.compute(ActionExecutionFunction.java:161)
  at com.google.devtools.build.skyframe.AbstractParallelEvaluator$Evaluate.run(AbstractParallelEvaluator.java:571)
  at com.google.devtools.build.lib.concurrent.AbstractQueueVisitor$WrappedRunnable.run(AbstractQueueVisitor.java:382)
  at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
  at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
  at java.base/java.lang.Thread.run(Unknown Source)
  Suppressed: java.io.IOException: io.grpc.StatusRuntimeException: UNAVAILABLE: io exception
Channel Pipeline: [SslHandler#0, ProtocolNegotiators$ClientTlsHandler#0, WriteBufferingAndExceptionHandler#0, DefaultChannelPipeline$TailContext#0]
    at com.google.devtools.build.lib.remote.GrpcCacheClient.lambda$downloadBlob$18(GrpcCacheClient.java:346)
    at com.google.common.util.concurrent.AbstractCatchingFuture$AsyncCatchingFuture.doFallback(AbstractCatchingFuture.java:203)
    at com.google.common.util.concurrent.AbstractCatchingFuture$AsyncCatchingFuture.doFallback(AbstractCatchingFuture.java:190)
    at com.google.common.util.concurrent.AbstractCatchingFuture.run(AbstractCatchingFuture.java:133)
    at com.google.common.util.concurrent.DirectExecutor.execute(DirectExecutor.java:31)
    at com.google.common.util.concurrent.AbstractFuture.executeListener(AbstractFuture.java:1270)
    at com.google.common.util.concurrent.AbstractFuture.complete(AbstractFuture.java:1038)
    at com.google.common.util.concurrent.AbstractFuture.setException(AbstractFuture.java:808)
    at com.google.common.util.concurrent.AbstractTransformFuture.run(AbstractTransformFuture.java:104)
    at com.google.common.util.concurrent.DirectExecutor.execute(DirectExecutor.java:31)
    at com.google.common.util.concurrent.AbstractFuture.executeListener(AbstractFuture.java:1270)
    at com.google.common.util.concurrent.AbstractFuture.complete(AbstractFuture.java:1038)
    at com.google.common.util.concurrent.AbstractFuture.setFuture(AbstractFuture.java:851)
    at com.google.common.util.concurrent.AbstractCatchingFuture$AsyncCatchingFuture.setResult(AbstractCatchingFuture.java:214)
    at com.google.common.util.concurrent.AbstractCatchingFuture$AsyncCatchingFuture.setResult(AbstractCatchingFuture.java:190)
    at com.google.common.util.concurrent.AbstractCatchingFuture.run(AbstractCatchingFuture.java:142)
    at com.google.common.util.concurrent.DirectExecutor.execute(DirectExecutor.java:31)
    at com.google.common.util.concurrent.AbstractFuture.executeListener(AbstractFuture.java:1270)
    at com.google.common.util.concurrent.AbstractFuture.complete(AbstractFuture.java:1038)
    at com.google.common.util.concurrent.AbstractFuture.setFuture(AbstractFuture.java:851)
    at com.google.common.util.concurrent.AbstractCatchingFuture$AsyncCatchingFuture.setResult(AbstractCatchingFuture.java:214)
    at com.google.common.util.concurrent.AbstractCatchingFuture$AsyncCatchingFuture.setResult(AbstractCatchingFuture.java:190)
    at com.google.common.util.concurrent.AbstractCatchingFuture.run(AbstractCatchingFuture.java:142)
    at com.google.common.util.concurrent.DirectExecutor.execute(DirectExecutor.java:31)
    at com.google.common.util.concurrent.AbstractFuture.executeListener(AbstractFuture.java:1270)
    at com.google.common.util.concurrent.AbstractFuture.complete(AbstractFuture.java:1038)
    at com.google.common.util.concurrent.AbstractFuture.setException(AbstractFuture.java:808)
    at com.google.common.util.concurrent.SettableFuture.setException(SettableFuture.java:55)
    at com.google.devtools.build.lib.remote.util.RxFutures$2.onError(RxFutures.java:257)
    at io.reactivex.rxjava3.internal.operators.single.SingleFlatMap$SingleFlatMapCallback$FlatMapSingleObserver.onError(SingleFlatMap.java:117)
    at io.reactivex.rxjava3.internal.operators.single.SingleUsing$UsingSingleObserver.onError(SingleUsing.java:180)
    at io.reactivex.rxjava3.internal.operators.single.SingleCreate$Emitter.tryOnError(SingleCreate.java:95)
    at io.reactivex.rxjava3.internal.operators.single.SingleCreate$Emitter.onError(SingleCreate.java:81)
    at com.google.devtools.build.lib.remote.util.RxFutures$OnceSingleOnSubscribe$1.onFailure(RxFutures.java:172)
    at com.google.common.util.concurrent.Futures$CallbackListener.run(Futures.java:1124)
    at com.google.common.util.concurrent.DirectExecutor.execute(DirectExecutor.java:31)
    at com.google.common.util.concurrent.AbstractFuture.executeListener(AbstractFuture.java:1270)
    at com.google.common.util.concurrent.AbstractFuture.complete(AbstractFuture.java:1038)
    at com.google.common.util.concurrent.AbstractFuture.setException(AbstractFuture.java:808)
    at com.google.common.util.concurrent.SettableFuture.setException(SettableFuture.java:55)
    at com.google.devtools.build.lib.remote.GrpcCacheClient$1.onError(GrpcCacheClient.java:422)
    at io.grpc.stub.ClientCalls$StreamObserverToCallListenerAdapter.onClose(ClientCalls.java:478)
    at io.grpc.PartialForwardingClientCallListener.onClose(PartialForwardingClientCallListener.java:39)
    at io.grpc.ForwardingClientCallListener.onClose(ForwardingClientCallListener.java:23)
    at io.grpc.ForwardingClientCallListener$SimpleForwardingClientCallListener.onClose(ForwardingClientCallListener.java:40)
    at com.google.devtools.build.lib.remote.NetworkTimeInterceptor$NetworkTimeCall$1.onClose(NetworkTimeInterceptor.java:81)
    at io.grpc.internal.ClientCallImpl.closeObserver(ClientCallImpl.java:562)
    at io.grpc.internal.ClientCallImpl.access$300(ClientCallImpl.java:70)
    at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInternal(ClientCallImpl.java:743)
    at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInContext(ClientCallImpl.java:722)
    at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
    at io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:133)
    ... 3 more
  Caused by: io.grpc.StatusRuntimeException: UNAVAILABLE: io exception
Channel Pipeline: [SslHandler#0, ProtocolNegotiators$ClientTlsHandler#0, WriteBufferingAndExceptionHandler#0, DefaultChannelPipeline$TailContext#0]
    at io.grpc.Status.asRuntimeException(Status.java:535)
    ... 14 more
  Caused by: io.netty.handler.ssl.SslHandshakeTimeoutException: handshake timed out after 10000ms
    at io.netty.handler.ssl.SslHandler$7.run(SslHandler.java:2113)
    at io.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:98)
    at io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:170)
    at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164)
    at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:469)
    at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:384)
    at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:986)
    at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
    at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
    ... 1 more

Evidence for the lack of retry is circumstantial: Unavailability of the server for >10s * 5 would have been unlikely, and if it was caused by a severed connection on time wait, I would have expected a reconnection attempt after the first. I observed this stack trace with a silent accepting remote endpoint from the Capabilities check (which also failed to retry).

What's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

Very little easy here: a server would have to lose connection after startup and be replaced with a silent accepting one in the time period before a download began.

Minimum spec: grpcs (not grpc plaintext), remote_cache (to inspire downloadBlob), hitting the remote cache for an action and downloading results.

Which operating system are you running Bazel on?

No response

What is the output of bazel info release?

6.2.0

If bazel info release returns development version or (@non-git), tell us how you built Bazel.

No response

What's the output of git remote get-url origin; git rev-parse master; git rev-parse HEAD ?

No response

Is this a regression? If yes, please try to identify the Bazel commit where the bug was introduced.

No response

Have you found anything relevant by searching the web?

No response

Any other information, logs, or outputs that you want to share?

No response

@werkt
Copy link
Contributor Author

werkt commented Jun 24, 2023

Confirmed. The SslHandshakeException is continuously generated for a given connection, without the elapsed time to indicate that it would timeout with retries. Additionally, the unwrapping that happens in ReferenceCountedChannel will discard the retriable UNAVAILABLE SRE, leaving a single chance for a server to respond within 10s to a build client on RemoteModule startup under TLS.

@werkt
Copy link
Contributor Author

werkt commented Jun 24, 2023

@coeuvre ^

@Pavank1992 Pavank1992 added the team-Remote-Exec Issues and PRs for the Execution (Remote) team label Jun 26, 2023
@zhengwei143 zhengwei143 added P2 We'll consider working on this in future. (Assignee optional) and removed untriaged labels Jun 27, 2023
@coeuvre
Copy link
Member

coeuvre commented Jun 27, 2023

Since you can reliably reproduce this error and it sounds easy to fix. Would you mind send us a PR?

@coeuvre coeuvre added the help wanted Someone outside the Bazel team could own this label Jun 27, 2023
@werkt
Copy link
Contributor Author

werkt commented Jun 27, 2023

Since you can reliably reproduce this error and it sounds easy to fix. Would you mind send us a PR?

Happy to. Solution will come in two parts - unwrap fix, then whatever is causing the same error from socket. I've only got the first so far.

copybara-service bot pushed a commit that referenced this issue Jul 6, 2023
Unwrapping all StatusRuntimeExceptions in in ReferenceCountedChannel when caused by IOException will discard critical tracing and retriability. The Retrier evaluations may not see an SRE in the causal chain, and presume it is invariably an unretriable exception. In general, IOExceptions as SRE wrappers are unsuitable containers and are routinely misued either for identification (grpc aware status), or capture (handleInitError).

Partially addresses #18764 (retries will occur with SSL handshake timeout, but the actual connection will not be retried)

Closes #18836.

PiperOrigin-RevId: 546037698
Change-Id: I7f6efcb857c557aa97ad3df085fc032c8538eb9a
werkt added a commit to werkt/bazel that referenced this issue Jul 17, 2023
Unwrapping all StatusRuntimeExceptions in in ReferenceCountedChannel when caused by IOException will discard critical tracing and retriability. The Retrier evaluations may not see an SRE in the causal chain, and presume it is invariably an unretriable exception. In general, IOExceptions as SRE wrappers are unsuitable containers and are routinely misued either for identification (grpc aware status), or capture (handleInitError).

Partially addresses bazelbuild#18764 (retries will occur with SSL handshake timeout, but the actual connection will not be retried)

Closes bazelbuild#18836.

PiperOrigin-RevId: 546037698
Change-Id: I7f6efcb857c557aa97ad3df085fc032c8538eb9a
werkt added a commit to werkt/bazel that referenced this issue Jul 17, 2023
Unwrapping all StatusRuntimeExceptions in in ReferenceCountedChannel when caused by IOException will discard critical tracing and retriability. The Retrier evaluations may not see an SRE in the causal chain, and presume it is invariably an unretriable exception. In general, IOExceptions as SRE wrappers are unsuitable containers and are routinely misued either for identification (grpc aware status), or capture (handleInitError).

Partially addresses bazelbuild#18764 (retries will occur with SSL handshake timeout, but the actual connection will not be retried)

Closes bazelbuild#18836.

PiperOrigin-RevId: 546037698
Change-Id: I7f6efcb857c557aa97ad3df085fc032c8538eb9a
@werkt
Copy link
Contributor Author

werkt commented Jul 19, 2023

The associated #18836 fixes this 'enough' to get the retry in place. The remaining issue is one that looks like grpc-java is doing interesting things after it receives the first handshake timeout - the same exception is thrown repeatedly while no attempt to reconnect occurs, which means that the retries aren't given their fair shake after the first timeout. It will require more investigation, but it is no longer a bug in bazel. Closing.

@werkt werkt closed this as completed Jul 19, 2023
werkt added a commit to werkt/bazel that referenced this issue Jul 24, 2023
Unwrapping all StatusRuntimeExceptions in in ReferenceCountedChannel when caused by IOException will discard critical tracing and retriability. The Retrier evaluations may not see an SRE in the causal chain, and presume it is invariably an unretriable exception. In general, IOExceptions as SRE wrappers are unsuitable containers and are routinely misued either for identification (grpc aware status), or capture (handleInitError).

Partially addresses bazelbuild#18764 (retries will occur with SSL handshake timeout, but the actual connection will not be retried)

Closes bazelbuild#18836.

PiperOrigin-RevId: 546037698
Change-Id: I7f6efcb857c557aa97ad3df085fc032c8538eb9a
keertk pushed a commit that referenced this issue Jul 25, 2023
* Include stack trace in all gRPC errors when --verbose_failures is set.

Also refactor a couple places where the stack trace was included in an ad-hoc
manner, and force Utils.grpcAwareErrorMessage callers to be explicit to avoid
future instances.

Closes #16086.

PiperOrigin-RevId: 502854490
Change-Id: Id2d6a1728630fffea9399b406378c7f173b247bd

* Avoid discarding SRE state for IO cause

Unwrapping all StatusRuntimeExceptions in in ReferenceCountedChannel when caused by IOException will discard critical tracing and retriability. The Retrier evaluations may not see an SRE in the causal chain, and presume it is invariably an unretriable exception. In general, IOExceptions as SRE wrappers are unsuitable containers and are routinely misued either for identification (grpc aware status), or capture (handleInitError).

Partially addresses #18764 (retries will occur with SSL handshake timeout, but the actual connection will not be retried)

Closes #18836.

PiperOrigin-RevId: 546037698
Change-Id: I7f6efcb857c557aa97ad3df085fc032c8538eb9a

* Operation stream termination is not an error

According to the GrpcRemoteExecutor when it occurs after a !done operation response. Remove the error from the
ExperimentalRemoteGrpcExecutor and reinforce both with tests.

Update the FakeExecutionService to generate compatible error responses that appear in the ExecuteResponse, rather than the operation error field, per the REAPI spec. Made required adjustments to ExGRE Test invocations to avoid the ExecutionStatusException interpretation of DEADLINE_EXCEEDED -> FAILED_PRECONDITION in ExecuteResponse.

Closes #18785.

PiperOrigin-RevId: 546925894
Change-Id: I7a489c8bc936a83cfd94e0138437f3fe6d152da8

* Done operations must be reexecuted

Any operation with done == true as reported by the server is not expected to change its result on subsequent waitExecution calls. To properly retry, this action must be reexecuted, if it was truly transient, to achieve a definitive result. Submit a transient status for retry, disallow special behaviors for NOT_FOUND as covered by done observation, and consider method type when handling operation streams.

Closes #18943.

PiperOrigin-RevId: 548680656
Change-Id: Ib2c9887ead1fbd3de97761db6e8b4077783ad03c

---------

Co-authored-by: Tiago Quelhas <tjgq@google.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Someone outside the Bazel team could own this P2 We'll consider working on this in future. (Assignee optional) team-Remote-Exec Issues and PRs for the Execution (Remote) team type: bug
Projects
None yet
Development

No branches or pull requests

5 participants