Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deadlock in grpc due to recursive grpc call #2515

Closed
garrettjonesgoogle opened this issue Dec 14, 2016 · 9 comments
Closed

Deadlock in grpc due to recursive grpc call #2515

garrettjonesgoogle opened this issue Dec 14, 2016 · 9 comments

Comments

@garrettjonesgoogle
Copy link
Contributor

garrettjonesgoogle commented Dec 14, 2016

Please answer these questions before submitting your issue.

What version of gRPC are you using?

1.0.1

What JVM are you using (java -version)?

openjdk version "1.8.0_102"
OpenJDK Runtime Environment (build 1.8.0_102)
OpenJDK 64-Bit Server VM (build 25.102-b01, mixed mode)

What did you do?

If possible, provide a recipe for reproducing the error.

  • Turned on FINE logging
  • Used the Logging client in google-cloud-java

What did you expect to see?

Not what I saw below...

What did you see instead?

Deadlock.

Found one Java-level deadlock:
=============================
"grpc-default-worker-ELG-1-1":
  waiting to lock monitor 0x00007fe628006318 (object 0x00000005cf4cead0, a java.lang.Object),
  which is held by "main"
"main":
  waiting to lock monitor 0x00007fe628002d78 (object 0x00000005ce107468, a java.lang.Object),
  which is held by "pool-1-thread-1"
"pool-1-thread-1":
  waiting to lock monitor 0x00007fe628006318 (object 0x00000005cf4cead0, a java.lang.Object),
  which is held by "main"

Java stack information for the threads listed above:
===================================================
"grpc-default-worker-ELG-1-1":
        at io.grpc.internal.DelayedClientTransport.newStream(DelayedClientTransport.java:121)
        - waiting to lock <0x00000005cf4cead0> (a java.lang.Object)
        at io.grpc.internal.ClientCallImpl.start(ClientCallImpl.java:214)
        at io.grpc.auth.ClientAuthInterceptor$1.checkedStart(ClientAuthInterceptor.java:104)
        at io.grpc.ClientInterceptors$CheckedForwardingClientCall.start(ClientInterceptors.java:195)
        at io.grpc.ForwardingClientCall.start(ForwardingClientCall.java:47)
        at com.google.api.gax.grpc.HeaderInterceptor$1.start(HeaderInterceptor.java:62)
        at io.grpc.stub.ClientCalls.startCall(ClientCalls.java:273)
        at io.grpc.stub.ClientCalls.asyncUnaryRequestCall(ClientCalls.java:252)
        at io.grpc.stub.ClientCalls.futureUnaryCall(ClientCalls.java:189)
        at com.google.api.gax.grpc.DirectCallable.futureCall(DirectCallable.java:60)
        at com.google.api.gax.grpc.ReentranceDetectingCallable.futureCall(ReentranceDetectingCallable.java:56)
        at com.google.api.gax.grpc.ExceptionTransformingCallable.futureCall(ExceptionTransformingCallable.java:64)
        at com.google.api.gax.grpc.RetryingCallable$RetryingResultFuture.issueCall(RetryingCallable.java:220)
        - locked <0x00000005cf8472e8> (a java.lang.Object)
        at com.google.api.gax.grpc.RetryingCallable.futureCall(RetryingCallable.java:88)
        at com.google.api.gax.grpc.UnaryCallable.futureCall(UnaryCallable.java:238)
        at com.google.api.gax.grpc.UnaryCallable.futureCall(UnaryCallable.java:249)
        at com.google.cloud.logging.spi.DefaultLoggingRpc.write(DefaultLoggingRpc.java:202)
        at com.google.cloud.logging.LoggingImpl.writeAsync(LoggingImpl.java:504)
        at com.google.cloud.logging.LoggingImpl.write(LoggingImpl.java:499)
        at com.google.cloud.logging.LoggingHandler.write(LoggingHandler.java:382)
        at com.google.cloud.logging.LoggingHandler.flush(LoggingHandler.java:407)
        at com.google.cloud.logging.LoggingHandler.publish(LoggingHandler.java:318)
        at java.util.logging.Logger.log(Logger.java:738)
        at io.netty.util.internal.logging.JdkLogger.log(JdkLogger.java:606)
        at io.netty.util.internal.logging.JdkLogger.debug(JdkLogger.java:186)
        at io.netty.util.internal.logging.AbstractInternalLogger.log(AbstractInternalLogger.java:147)
        at io.netty.handler.codec.http2.Http2FrameLogger.log(Http2FrameLogger.java:183)
        at io.netty.handler.codec.http2.Http2FrameLogger.logSettings(Http2FrameLogger.java:111)
        at io.netty.handler.codec.http2.Http2OutboundFrameLogger.writeSettings(Http2OutboundFrameLogger.java:81)
        at io.netty.handler.codec.http2.DefaultHttp2ConnectionEncoder.writeSettings(DefaultHttp2ConnectionEncoder.java:246)
        at io.netty.handler.codec.http2.DecoratingHttp2FrameWriter.writeSettings(DecoratingHttp2FrameWriter.java:70)
        at io.netty.handler.codec.http2.Http2ConnectionHandler$PrefaceDecoder.sendPreface(Http2ConnectionHandler.java:324)
        at io.netty.handler.codec.http2.Http2ConnectionHandler$PrefaceDecoder.channelActive(Http2ConnectionHandler.java:224)
        at io.netty.handler.codec.http2.Http2ConnectionHandler.channelActive(Http2ConnectionHandler.java:363)
        at io.grpc.netty.AbstractNettyHandler.channelActive(AbstractNettyHandler.java:78)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelActive(AbstractChannelHandlerContext.java:223)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelActive(AbstractChannelHandlerContext.java:209)
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelActive(AbstractChannelHandlerContext.java:202)
        at io.netty.handler.ssl.SslHandler.channelActive(SslHandler.java:1398)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelActive(AbstractChannelHandlerContext.java:223)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelActive(AbstractChannelHandlerContext.java:209)
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelActive(AbstractChannelHandlerContext.java:202)
        at io.netty.channel.DefaultChannelPipeline$HeadContext.channelActive(DefaultChannelPipeline.java:1322)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelActive(AbstractChannelHandlerContext.java:223)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelActive(AbstractChannelHandlerContext.java:209)
        at io.netty.channel.DefaultChannelPipeline.fireChannelActive(DefaultChannelPipeline.java:902)
        at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.fulfillConnectPromise(AbstractNioChannel.java:305)
        at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:335)
        at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:588)
        at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:512)
        at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:426)
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:398)
        at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:877)
        at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
        at java.lang.Thread.run(Thread.java:745)
"main":
        at io.grpc.internal.InUseStateAggregator.updateObjectInUse(InUseStateAggregator.java:51)
        - waiting to lock <0x00000005ce107468> (a java.lang.Object)
        at io.grpc.internal.TransportSet$BaseTransportListener.transportInUse(TransportSet.java:373)
        at io.grpc.internal.DelayedClientTransport.newStream(DelayedClientTransport.java:129)
        - locked <0x00000005cf4cead0> (a java.lang.Object)
        at io.grpc.internal.ClientCallImpl.start(ClientCallImpl.java:214)
        at io.grpc.auth.ClientAuthInterceptor$1.checkedStart(ClientAuthInterceptor.java:104)
        at io.grpc.ClientInterceptors$CheckedForwardingClientCall.start(ClientInterceptors.java:195)
        at io.grpc.ForwardingClientCall.start(ForwardingClientCall.java:47)
        at com.google.api.gax.grpc.HeaderInterceptor$1.start(HeaderInterceptor.java:62)
        at io.grpc.stub.ClientCalls.startCall(ClientCalls.java:273)
        at io.grpc.stub.ClientCalls.asyncUnaryRequestCall(ClientCalls.java:252)
        at io.grpc.stub.ClientCalls.futureUnaryCall(ClientCalls.java:189)
        at com.google.api.gax.grpc.DirectCallable.futureCall(DirectCallable.java:60)
        at com.google.api.gax.grpc.ReentranceDetectingCallable.futureCall(ReentranceDetectingCallable.java:56)
        at com.google.api.gax.grpc.ExceptionTransformingCallable.futureCall(ExceptionTransformingCallable.java:64)
        at com.google.api.gax.grpc.RetryingCallable$RetryingResultFuture.issueCall(RetryingCallable.java:220)
        - locked <0x00000005cd8b7740> (a java.lang.Object)
        at com.google.api.gax.grpc.RetryingCallable.futureCall(RetryingCallable.java:88)
        at com.google.api.gax.grpc.UnaryCallable.futureCall(UnaryCallable.java:238)
        at com.google.api.gax.grpc.UnaryCallable.futureCall(UnaryCallable.java:249)
        at com.google.cloud.logging.spi.DefaultLoggingRpc.write(DefaultLoggingRpc.java:202)
        at com.google.cloud.logging.LoggingImpl.writeAsync(LoggingImpl.java:504)
        at com.google.cloud.logging.LoggingImpl.write(LoggingImpl.java:499)
        at com.google.cloud.logging.LoggingHandler.write(LoggingHandler.java:382)
        at com.google.cloud.logging.LoggingHandler.flush(LoggingHandler.java:407)
        at com.google.cloud.logging.LoggingHandler.publish(LoggingHandler.java:318)
        at java.util.logging.Logger.log(Logger.java:738)
        at java.util.logging.Logger.doLog(Logger.java:765)
        at java.util.logging.Logger.log(Logger.java:788)
        at java.util.logging.Logger.info(Logger.java:1490)
        at com.google.cloud.runtimes.jetty9.StackDriverLogging.init(StackDriverLogging.java:37)
        at com.google.cloud.runtimes.jetty9.StackDriverLogging.main(StackDriverLogging.java:64)
"pool-1-thread-1":
        at io.grpc.internal.DelayedClientTransport.newStream(DelayedClientTransport.java:121)
        - waiting to lock <0x00000005cf4cead0> (a java.lang.Object)
        at io.grpc.internal.ClientCallImpl.start(ClientCallImpl.java:214)
        at io.grpc.auth.ClientAuthInterceptor$1.checkedStart(ClientAuthInterceptor.java:104)
        at io.grpc.ClientInterceptors$CheckedForwardingClientCall.start(ClientInterceptors.java:195)
        at io.grpc.ForwardingClientCall.start(ForwardingClientCall.java:47)
        at com.google.api.gax.grpc.HeaderInterceptor$1.start(HeaderInterceptor.java:62)
        at io.grpc.stub.ClientCalls.startCall(ClientCalls.java:273)
        at io.grpc.stub.ClientCalls.asyncUnaryRequestCall(ClientCalls.java:252)
        at io.grpc.stub.ClientCalls.futureUnaryCall(ClientCalls.java:189)
        at com.google.api.gax.grpc.DirectCallable.futureCall(DirectCallable.java:60)
        at com.google.api.gax.grpc.ReentranceDetectingCallable.futureCall(ReentranceDetectingCallable.java:56)
        at com.google.api.gax.grpc.ExceptionTransformingCallable.futureCall(ExceptionTransformingCallable.java:64)
        at com.google.api.gax.grpc.RetryingCallable$RetryingResultFuture.issueCall(RetryingCallable.java:220)
        - locked <0x00000005cfbce9e0> (a java.lang.Object)
        at com.google.api.gax.grpc.RetryingCallable.futureCall(RetryingCallable.java:88)
        at com.google.api.gax.grpc.UnaryCallable.futureCall(UnaryCallable.java:238)
        at com.google.api.gax.grpc.UnaryCallable.futureCall(UnaryCallable.java:249)
        at com.google.cloud.logging.spi.DefaultLoggingRpc.write(DefaultLoggingRpc.java:202)
        at com.google.cloud.logging.LoggingImpl.writeAsync(LoggingImpl.java:504)
        at com.google.cloud.logging.LoggingImpl.write(LoggingImpl.java:499)
        at com.google.cloud.logging.LoggingHandler.write(LoggingHandler.java:382)
        at com.google.cloud.logging.LoggingHandler.flush(LoggingHandler.java:407)
        at com.google.cloud.logging.LoggingHandler.publish(LoggingHandler.java:318)
        at java.util.logging.Logger.log(Logger.java:738)
        at java.util.logging.Logger.doLog(Logger.java:765)
        at java.util.logging.Logger.log(Logger.java:851)
        at io.grpc.internal.TransportSet.startNewTransport(TransportSet.java:217)
        at io.grpc.internal.TransportSet.obtainActiveTransport(TransportSet.java:192)
        - locked <0x00000005ce107468> (a java.lang.Object)
        at io.grpc.internal.ManagedChannelImpl$3.getTransport(ManagedChannelImpl.java:637)
        at io.grpc.internal.ManagedChannelImpl$3.getTransport(ManagedChannelImpl.java:579)
        at io.grpc.DummyLoadBalancerFactory$DummyLoadBalancer$1.get(DummyLoadBalancerFactory.java:135)
        at io.grpc.internal.DelayedClientTransport$2.run(DelayedClientTransport.java:262)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:295)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)

Background: I have been trying various strategies to resolve googleapis/google-cloud-java#1386 , where using the Logging service at level FINE results in grpc logging to the Logging service in a recursive way. I tried using a ThreadLocal to prevent this, but this doesn't work with grpc because the actual call is executed on a worker thread. Essentially I think I need some way to bail out of the LoggingHandler.publish call if I can detect that this is in the scope of a grpc worker thread sending a request to the Logging service.

@carl-mastrangelo
Copy link
Contributor

This may be fixed at head, this code has been traditionally deadlock prone, and is currently being rewritten.

cc: @zhangkun83

@garrettjonesgoogle
Copy link
Contributor Author

Well, the deadlock is one issue, but for my purposes, it's just a symptom. My core need is the ability to avoid sending a log to the logging service when LoggingHandler.publish is called in a grpc worker thread. Do you have any suggestions for that?

@zhangkun83
Copy link
Contributor

Symptom-wise, pool-1-thread-1 first acquires the channel lock 0x00000005ce107468, then acquires the delayed transport lock 0x00000005cf4cead0, which violates the lock order transport lock > delayed transport lock > channel lock, thus causes the deadlock.

The reason for the reversed lock order, is that channel calls the java logger while holding the channel lock, while the logger unexpectedly starts an RPC. IIUC, even if there weren't deadlock, it would be an infinite recursion instead. Off the top of my head, you probably want to remove the RPC-based log handlers from io.grpc and io.netty

@garrettjonesgoogle
Copy link
Contributor Author

Right, the infinite recursion is what I want to prevent. Simply removing the RPC-based log handlers from io.grpc and io.netty isn't really what we want to do though, because we want to be capable of logging grpc and netty logs to non-Logging services.

@zhangkun83
Copy link
Contributor

What we need is a per-(gRPC)channel log configuration, so that the channel used by logger would not use the RPC-based logging. It doesn't seem possible with the Java logging API.

We could create a custom logging API for gRPC that can be configured on a channel basis, but it doesn't seem feasible for netty.

@garrettjonesgoogle
Copy link
Contributor Author

One alternative is for gRPC to set a ThreadLocal that google-cloud-java code could read in order to determine if it is in the context of a gRPC worker thread.

@zhangkun83
Copy link
Contributor

zhangkun83 commented Dec 15, 2016

because we want to be capable of logging grpc and netty logs to non-Logging services.

I don't fully understand this. Do you mean you want grpc and netty to also log to the RPC-based handler?

One alternative is for gRPC to set a ThreadLocal that google-cloud-java code could read in order to determine if it is in the context of a gRPC worker thread.

How would it work? I was thinking the other way: the RPC-based logger could set the ThreadLocal context, like this:

static final Context.Key<Boolean> IN_CLOUD_LOGGER_KEY;

publish() {
  Boolean inCloudLogger = IN_CLOUD_LOGGER_KEY.get();
  if (inCloudLogger == null) {
    Context newCtx = Context.current().withValue(IN_CLOUD_LOGGER_KEY, true);
    Context origCtx = newCtx.attach();
    try {
      // Do the RPC-based logging
      ...
    } finally {
      newCtx.detach(origCtx);
    }
  } else {
    // Do the non-RPC-based logging
  }
}

@garrettjonesgoogle
Copy link
Contributor Author

because we want to be capable of logging grpc and netty logs to non-Logging services.

I don't fully understand this. Do you mean you want grpc and netty to also log to the RPC-based handler?

We want to publish logs from grpc calls, but we wouldn't want the flush() call to run in the grpc worker thread (because it makes the call to the logging service) .

One alternative is for gRPC to set a ThreadLocal that google-cloud-java code could read in order to determine if it is in the context of a gRPC worker thread.

How would it work? I was thinking the other way: the RPC-based logger could set the ThreadLocal context, like this:

static final Context.Key<Boolean> IN_CLOUD_LOGGER_KEY;

publish() {
  Boolean inCloudLogger = IN_CLOUD_LOGGER_KEY.get();
  if (inCloudLogger == null) {
    Context newCtx = Context.current().withValue(IN_CLOUD_LOGGER_KEY, true);
    Context origCtx = newCtx.attach();
    try {
      // Do the RPC-based logging
      ...
    } finally {
      newCtx.detach(origCtx);
    }
  } else {
    // Do the non-RPC-based logging
  }
}

I dug into the docs for Context and Context.Key and I think I understand it. It basically looks like this feature will work for me and I can already put it to use.

@zhangkun83
Copy link
Contributor

Closing this issue as it's not a bug of gRPC per se, and a solution has been accepted.

@lock lock bot locked as resolved and limited conversation to collaborators Sep 22, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants