-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
need guidance for detecting client stream close in interceptor #5324
Comments
Possibly related to #5323 ? |
Thanks @hexfusion! It certainly seems to me like that patch could fix our panic. @dfawley what do you think? |
#5323 should only fix cases related to retry and a race that happens in a very small window. Do you have retries enabled (through service config either specified by the client to |
I do not believe we explicitly enable any sort of retries. But, from what I can gather, "transparent retries" are always enabled; can those not trigger the #5323 race? I'm also confused by this comment on the Anyway, if you don't believe #5323 is my answer, I'm curious to understand if our use of |
"Retry support" is always enabled unless that dial option is set (and that environment variable no longer does anything; sorry). However, retries are not configured to happen by default except for transparent retries. It requires explicit configuration to declare the retryable status codes.
They are. And they could under very tight races (it might take 2 races hitting at once). Basically your first attempt has to create a transport stream successfully, then that stream ends up getting a REFUSED_STREAM response or a GOAWAY that was already in flight when the stream was created to enable a transparent retry. Then to get the nil pointer in the attempt, you would need to select the transport and then fail to create the stream but not transparently retry it. (Calling
This change should make this whole class of error impossible. So I think it will fix your panics. If you have a way to test that theory, it would be a nice data point to see if the PR works for you.
It's okay to call
I took a look at that stack overflow question and added a comment about clients not receiving the io.EOF (they are broken unless they cancel the RPC context or close the clientconn). OpenTelemetry seems fine to be using the user's context, as cancelation of that will cancel and clean up the stream correctly. The only reason to use the other context is to get the peer address / etc that grpc adds into the context. Generally, interceptors should be able to assume clients are following the rules listed here: https://pkg.go.dev/google.golang.org/grpc#ClientConn.NewStream Not doing so leaks resources in gRPC-Go. |
Thanks Doug!
Great! I will do some experiments to see how easily I can reproduce the panic (it's definitely not frequent). If I can reproduce it, I will try your patch and report back.
I see; so the comment on
Thank you; that's what I thought. So the finalizer thing that we and OpenTracing do is non-sense (assuming that the RPC callers are well behaved), right?
Ack. Then I will also switch our code to listen on the user's context. Which should fix the panic, independently of #5323.
Ack. |
Before this patch, our client-side tracing interceptor for streaming rpc calls was exposed to gRPC bugs being currently fixed in github.com/grpc/grpc-go/pull/5323. This had to do with calls to clientStream.Context() panicking with an NPE under certain races with RPCs failing. We've recently gotten two crashes seemingly because of this. It's unclear why this hasn't shown up before, as nothing seems new (either on our side or on the grpc side). In 22.2 we do use more streaming RPCs than before (for example for span configs), so maybe that does it. This patch eliminates the problem by eliminating the problematic call into ClientStream.Context(). The background is that our interceptors needs to watch for ctx cancelation and consider the RPC done at that point. But, there was no reason for that call; we can more simply use the RPC caller's ctx for the purposes of figuring out if the caller cancels the RPC. In fact, calling ClientStream.Context() is bad for other reasons, beyond exposing us to the bug: 1) ClientStream.Context() pins the RPC attempt to a lower-level connection, and inhibits gRPC's ability to sometimes transparently retry failed calls. In fact, there's a comment on ClientStream.Context() that tells you not to call it before using the stream through other methods like Recv(), which imply that the RPC is already "pinned" and transparent retries are no longer possible anyway. We were breaking this. 2) One of the grpc-go maintainers suggested that, due to the bugs reference above, this call could actually sometimes give us "the wrong context", although how wrong exactly is not clear to me (i.e. could we have gotten a ctx that doesn't inherit from the caller's ctx? Or a ctx that's canceled independently from the caller?) This patch also removes a paranoid catch-all finalizer in the interceptor that assured that the RPC client span is always eventually closed (at a random GC time), regardless of what the caller does - i.e. even if the caller forgets about interacting with the response stream or canceling the ctx. This kind of paranoia is not needed. The code in question was copied from grpc-opentracing[1], which quoted a StackOverflow answer[2] about whether or not a client is allowed to simply walk away from a streaming call. As a result of conversations triggered by this patch [3], that SO answer was updated to reflect the fact that it is not, in fact, OK for a client to do so, as it will leak gRPC resources. The client's contract is specified in [4] (although arguably that place is not the easiest to find by a casual gRPC user). In any case, this patch gets rid of the finalizer. This could in theory result in leaked spans if our own code is buggy in the way that the paranoia prevented, but all our TestServers check that spans don't leak like that so we are pretty protected. FWIW, a newer re-implementation of the OpenTracing interceptor[4] doesn't have the finalizer (although it also doesn't listen for ctx cancellation, so I think it's buggy), and neither does the equivalent OpenTelemetry interceptor[6]. Fixes cockroachdb#80689 [1] https://github.com/grpc-ecosystem/grpc-opentracing/blob/8e809c8a86450a29b90dcc9efbf062d0fe6d9746/go/otgrpc/client.go#L174 [2] https://stackoverflow.com/questions/42915337/are-you-required-to-call-recv-until-you-get-io-eof-when-interacting-with-grpc-cl [3] grpc/grpc-go#5324 [4] https://pkg.go.dev/google.golang.org/grpc#ClientConn.NewStream [5] https://github.com/grpc-ecosystem/go-grpc-middleware/blob/master/tracing/opentracing/client_interceptors.go#L37-L52 [6] https://github.com/open-telemetry/opentelemetry-go-contrib/blame/main/instrumentation/google.golang.org/grpc/otelgrpc/interceptor.go#L193 Release note: A rare crash indicating a nil-pointer deference in google.golang.org/grpc/internal/transport.(*Stream).Context(...) was fixed.
Before this patch, our client-side tracing interceptor for streaming rpc calls was exposed to gRPC bugs being currently fixed in github.com/grpc/grpc-go/pull/5323. This had to do with calls to clientStream.Context() panicking with an NPE under certain races with RPCs failing. We've recently gotten two crashes seemingly because of this. It's unclear why this hasn't shown up before, as nothing seems new (either on our side or on the grpc side). In 22.2 we do use more streaming RPCs than before (for example for span configs), so maybe that does it. This patch eliminates the problem by eliminating the problematic call into ClientStream.Context(). The background is that our interceptors needs to watch for ctx cancelation and consider the RPC done at that point. But, there was no reason for that call; we can more simply use the RPC caller's ctx for the purposes of figuring out if the caller cancels the RPC. In fact, calling ClientStream.Context() is bad for other reasons, beyond exposing us to the bug: 1) ClientStream.Context() pins the RPC attempt to a lower-level connection, and inhibits gRPC's ability to sometimes transparently retry failed calls. In fact, there's a comment on ClientStream.Context() that tells you not to call it before using the stream through other methods like Recv(), which imply that the RPC is already "pinned" and transparent retries are no longer possible anyway. We were breaking this. 2) One of the grpc-go maintainers suggested that, due to the bugs reference above, this call could actually sometimes give us "the wrong context", although how wrong exactly is not clear to me (i.e. could we have gotten a ctx that doesn't inherit from the caller's ctx? Or a ctx that's canceled independently from the caller?) This patch also removes a paranoid catch-all finalizer in the interceptor that assured that the RPC client span is always eventually closed (at a random GC time), regardless of what the caller does - i.e. even if the caller forgets about interacting with the response stream or canceling the ctx. This kind of paranoia is not needed. The code in question was copied from grpc-opentracing[1], which quoted a StackOverflow answer[2] about whether or not a client is allowed to simply walk away from a streaming call. As a result of conversations triggered by this patch [3], that SO answer was updated to reflect the fact that it is not, in fact, OK for a client to do so, as it will leak gRPC resources. The client's contract is specified in [4] (although arguably that place is not the easiest to find by a casual gRPC user). In any case, this patch gets rid of the finalizer. This could in theory result in leaked spans if our own code is buggy in the way that the paranoia prevented, but all our TestServers check that spans don't leak like that so we are pretty protected. FWIW, a newer re-implementation of the OpenTracing interceptor[4] doesn't have the finalizer (although it also doesn't listen for ctx cancellation, so I think it's buggy), and neither does the equivalent OpenTelemetry interceptor[6]. Fixes cockroachdb#80689 [1] https://github.com/grpc-ecosystem/grpc-opentracing/blob/8e809c8a86450a29b90dcc9efbf062d0fe6d9746/go/otgrpc/client.go#L174 [2] https://stackoverflow.com/questions/42915337/are-you-required-to-call-recv-until-you-get-io-eof-when-interacting-with-grpc-cl [3] grpc/grpc-go#5324 [4] https://pkg.go.dev/google.golang.org/grpc#ClientConn.NewStream [5] https://github.com/grpc-ecosystem/go-grpc-middleware/blob/master/tracing/opentracing/client_interceptors.go#L37-L52 [6] https://github.com/open-telemetry/opentelemetry-go-contrib/blame/main/instrumentation/google.golang.org/grpc/otelgrpc/interceptor.go#L193 Release note: A rare crash indicating a nil-pointer deference in google.golang.org/grpc/internal/transport.(*Stream).Context(...) was fixed.
80476: security: update the TLS cipher suite list r=bdarnell a=knz Fixes #80483 This does not really change the list, it merely explains more clearly how it was built. Release note: None 80635: release: use preflight tool for RedHat images r=celiala a=rail RedHat Connect has introduced a new certification workflow, which requires running some tests locally and submit the results back to RedHat. This patch runs the `preflight` tool to address the new workflow changes. Fixes #80633 Release note: None 80878: util/tracing: fix crash in StreamClientInterceptor r=andreimatei a=andreimatei Before this patch, our client-side tracing interceptor for streaming rpc calls was exposed to gRPC bugs being currently fixed in github.com/grpc/grpc-go#5323. This had to do with calls to clientStream.Context() panicking with an NPE under certain races with RPCs failing. We've recently gotten two crashes seemingly because of this. It's unclear why this hasn't shown up before, as nothing seems new (either on our side or on the grpc side). In 22.2 we do use more streaming RPCs than before (for example for span configs), so maybe that does it. This patch eliminates the problem by eliminating the problematic call into ClientStream.Context(). The background is that our interceptors needs to watch for ctx cancelation and consider the RPC done at that point. But, there was no reason for that call; we can more simply use the RPC caller's ctx for the purposes of figuring out if the caller cancels the RPC. In fact, calling ClientStream.Context() is bad for other reasons, beyond exposing us to the bug: 1) ClientStream.Context() pins the RPC attempt to a lower-level connection, and inhibits gRPC's ability to sometimes transparently retry failed calls. In fact, there's a comment on ClientStream.Context() that tells you not to call it before using the stream through other methods like Recv(), which imply that the RPC is already "pinned" and transparent retries are no longer possible anyway. We were breaking this. 2) One of the grpc-go maintainers suggested that, due to the bugs reference above, this call could actually sometimes give us "the wrong context", although how wrong exactly is not clear to me (i.e. could we have gotten a ctx that doesn't inherit from the caller's ctx? Or a ctx that's canceled independently from the caller?) This patch also removes a paranoid catch-all finalizer in the interceptor that assured that the RPC client span is always eventually closed (at a random GC time), regardless of what the caller does - i.e. even if the caller forgets about interacting with the response stream or canceling the ctx. This kind of paranoia is not needed. The code in question was copied from grpc-opentracing[1], which quoted a StackOverflow answer[2] about whether or not a client is allowed to simply walk away from a streaming call. As a result of conversations triggered by this patch [3], that SO answer was updated to reflect the fact that it is not, in fact, OK for a client to do so, as it will leak gRPC resources. The client's contract is specified in [4] (although arguably that place is not the easiest to find by a casual gRPC user). In any case, this patch gets rid of the finalizer. This could in theory result in leaked spans if our own code is buggy in the way that the paranoia prevented, but all our TestServers check that spans don't leak like that so we are pretty protected. FWIW, a newer re-implementation of the OpenTracing interceptor[4] doesn't have the finalizer (although it also doesn't listen for ctx cancellation, so I think it's buggy), and neither does the equivalent OpenTelemetry interceptor[6]. Fixes #80689 [1] https://github.com/grpc-ecosystem/grpc-opentracing/blob/8e809c8a86450a29b90dcc9efbf062d0fe6d9746/go/otgrpc/client.go#L174 [2] https://stackoverflow.com/questions/42915337/are-you-required-to-call-recv-until-you-get-io-eof-when-interacting-with-grpc-cl [3] grpc/grpc-go#5324 [4] https://pkg.go.dev/google.golang.org/grpc#ClientConn.NewStream [5] https://github.com/grpc-ecosystem/go-grpc-middleware/blob/master/tracing/opentracing/client_interceptors.go#L37-L52 [6] https://github.com/open-telemetry/opentelemetry-go-contrib/blame/main/instrumentation/google.golang.org/grpc/otelgrpc/interceptor.go#L193 Release note: A rare crash indicating a nil-pointer deference in google.golang.org/grpc/internal/transport.(*Stream).Context(...) was fixed. Co-authored-by: Raphael 'kena' Poss <knz@thaumogen.net> Co-authored-by: Rail Aliiev <rail@iqchoice.com> Co-authored-by: Andrei Matei <andrei@cockroachlabs.com>
Yes, that sounds right.
Yes, I don't think that should be required, assuming your application is following the requirements that the gRPC documentation lists. The finalizer idea may originate from this comment.
Yes, it is a child of the application's context passed to
If you are installing a
Yes, this would work if you have an interceptor and not a stats handler. |
Let us know if you have any more questions on this. |
Hello gRPC friends,
We (CockroachDB) are seeing a rare nil-pointer crashes on calls to
ClientStream.Context()
from inside a streaming interceptor. I'd like to kindly ask for your guidance in dealing with it, because the situation is not clear to me and the struggle is real. The code in question is copied from an opentracing library, so the issue is potentially affecting others too.The background is that we have a client streaming interceptor that creates tracing spans (think OpenTracing/OpenTelementry) for every streaming RPC call. We want the span to live for "the life of the stream". The interceptor in question is here; it calls the wrapped
Streamer
to get aClientStream
, and then creates a tracing span, and then wraps the innerClientStream
in our own implementation that deal with signaling the end of the stream so that we close the span.The crash I'm talking about is here. This code waits for the "stream to end" through either
Recv()
receiving anEOF
, or through the stream's ctx being canceled. And it's thiscs.Context()
that seems to sometimes (rarely) crash, with an NPE from gRPC (from here). I'm not sure why it crashes, and I'm also not sure that we're doing the right thing in the first place.The contract of
ClientStream
says:I believe we're breaking this contract, as we're calling
Context()
beforeRecvMsg()
. At least, before our code callsRecvMsg()
, and I also don't see a call inside gRPC, happening atClientStream
creation time. So, perhaps this is our problem. Except it seems surprising to me that, if the answer is this clear, we're not crashing all the time.I've looked a bit deeper, in gRPC code, to try to understand where
Context()
's contract is coming from. I couldn't figure it out. The only interesting thing I found was a comment on thecs.attempt
field, that says:I believe the issue that this comment hints to is not relevant to us, because out
Context()
call comes afternewClientStream()
. The panicking line iscs.attempt.s.Context()
, andattempt.s.ctx
are initialized early, as far as I can tell. So, I'm not sure whyClientStream.Context()
seems to insist on not being called beforeRecvMsg()
. Perhaps you could clarify?The second part of my question is about whether our code is generally sane / how should it look like? What's the recommended way to be notified of a streaming RPCs end? We're doing two things that I find dubious:
ClientStream.Context()
, as opposed to the caller's context.Do these look right to you?
We've copied this code from grpc-opentracing, which is, or at least used to be, a fairly high-profile library. In particular, they motivate the finalizer with a reference to a stackoverflow topic which seems dubious to me. Do you agree with it?
I've checked how other similar libraries do this, and the answer seems to be always different:
Thank you very much for your help! Please let me know if I should clarify anything.
The text was updated successfully, but these errors were encountered: