You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
So if shutdown calls peer.Single.Stop() and then grpc.Transport.Stop(), the peer will be removed after having only been told to stop(), and the transport's Stop() will not wait for it to stop its background goroutine.
I'm not 100% certain that shutdown occurs in this order (fx logs don't make that explicit), but it seems like it probably has to as peers are used in outbounds. Stop RPC == stop outbounds -> stop peers -> stop transports, right?
I'm not seeing any way to patch this from the outside, as the peer's instance and API doesn't seem to be exposed anywhere. Which is probably a good thing. So I think this has to be fixed internally.
As a possibly simple option: maybe grpc.Transport should just keep all stop-chans (remove the peer but not the chan in ReleasePeer) and wait on all of them during Stop()? It would leak empty chans unless some cleanup process was run, but if that's an issue then closed chans could probably be cleared out in ReleasePeer as a garbage collector.
Or should ReleasePeer just wait too? I'm not sure what the semantics are here, but it seems like it may be intentional that it doesn't wait.
I haven't carefully checked the other transports to see if they have similar issues, but e.g. http is sufficiently different that it doesn't obviously have the same problem.
The text was updated successfully, but these errors were encountered:
When converting to some zaptest loggers in some internal tests, I started getting occasional test panics like:
After digging around a bit, I can see we are using some single-peer-choosers with grpc, and:
So if shutdown calls
peer.Single.Stop()
and thengrpc.Transport.Stop()
, the peer will be removed after having only been told tostop()
, and the transport'sStop()
will not wait for it to stop its background goroutine.I'm not 100% certain that shutdown occurs in this order (fx logs don't make that explicit), but it seems like it probably has to as peers are used in outbounds. Stop RPC == stop outbounds -> stop peers -> stop transports, right?
I'm not seeing any way to patch this from the outside, as the peer's instance and API doesn't seem to be exposed anywhere. Which is probably a good thing. So I think this has to be fixed internally.
As a possibly simple option: maybe
grpc.Transport
should just keep all stop-chans (remove the peer but not the chan inReleasePeer
) and wait on all of them duringStop()
? It would leak empty chans unless some cleanup process was run, but if that's an issue then closed chans could probably be cleared out inReleasePeer
as a garbage collector.Or should ReleasePeer just wait too? I'm not sure what the semantics are here, but it seems like it may be intentional that it doesn't wait.
I haven't carefully checked the other transports to see if they have similar issues, but e.g. http is sufficiently different that it doesn't obviously have the same problem.
The text was updated successfully, but these errors were encountered: