client: fix race between connection error and subconn shutdown #6494
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This is something I noticed when adding the
SubConn.Shutdown
method in #6493, but never observed. If we fail to checkac.ctx
while holdingac.mu
, then the following race is technically possible:resetTransport
,tryAllAddrs
is called and returns an error.acCtx
(AKAac.ctx
).Err() == nil
as theaddrConn
is still in use.RemoveSubConn
(orSubConn.Shutdown
) is called. This sets the state toShutdown
and cancelsac.ctx
.resetTransport
, we grabac.mu
and update the state toTransientFailure
.The net result is the LB policy's state listener (or
UpdateSubConnState
) will receiveShutdown
followed byTransientFailure
. All LB policies should be ignoring updates afterShutdown
anyway, but this is still an incorrect transition.I don't believe this case is worth trying to stimulate with a test. The race window is extremely tiny, and such bugs could have easily existed in several other places in our code. I confirmed via manual inspection that everything either checks
ac.ctx
orac.transport
while holdingac.mu
before callingac.updateConnectivityState
.RELEASE NOTES: none