Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

clientconn: Subconn state may transition from CONNECTING directly to IDLE #7503

Closed
arjan-bal opened this issue Aug 12, 2024 · 3 comments
Closed

Comments

@arjan-bal
Copy link
Contributor

In this test run it is seen that the subconn transitioned directly from CONNECTING to IDLE when the server was shut down. The current pickfirst implementation handles this by transitioning the channel to IDLE itself. However, the new pickfirst for Dual Stack gets stuck when this happens because it behaves differently based on the state after CONNECTING:

  1. If the next state is READY, it selects the subconn and transitions the channel to READY.
  2. If the next state is TRANSIENT FAILURE, it tries to connect to the next subconn. The channel is transitioned to TF if all subconns have been tried once.

Based on the above, if the new PF sees IDLE directly, it doesn't know how to handle it. If the missing state b/w CONNECTING and IDLE was READY, then the channel would also need to be sent into IDLE. If the missing state b/w CONNECTING and IDLE was TF, then the channel state would remain unchanged and the subconn may be asked to reconnect.

To fix this, we should ensure that the transition to IDLE due to server sent GOAWAY should be reported only once the subconn has exited CONNECTING.

@arjan-bal arjan-bal self-assigned this Aug 12, 2024
@arjan-bal
Copy link
Contributor Author

arjan-bal commented Aug 12, 2024

Once Happy Eyeballs is implemented, the new PF will move on to the next subconn after some time, so this unexpected transition will not block the connection attempt.

@arjan-bal
Copy link
Contributor Author

I tried to repro this issue by adding delays in the code and realised that the code correctly handles the cases in which a GoAway (or any other error) is received before the subconn state changes to READY because it locks ac.mu assigns ac.transport to the new transport, puts the subconn in READY and only then release the ac.mu. If onClose is called before ac.mu is locked, the subconn state is not set to IDLE because of this early return block in onClose.

The problem is in this block that has the comment:
// onClose was already called for this connection, but the connection
// was successfully established first. Consider it a success and set
// the new state to Idle.

This block is there to handle the exact scenario that I suspected was causing the issue, but it sets the state from CONNECTING directly to IDLE. There doesn't seem to be any test that covers that block, but I was able to call hcancel() just before it and verify that that the subconn does go from CONNECTING to IDLE.

I need to check if we should be transitioning to READY before IDLE or expect LB policies to handle a direct transition to IDLE.

@arjan-bal
Copy link
Contributor Author

From discussing with other maintainers, the suggestion was to block the onClose callback so it can't do stuff until after NewClientTransport returns and we assign the transport.

I tried to make onClose block, but some unit tests failed due to context timeout because transport.NewClientTransport calls onClose from the same go routine, causing a deadlock.
I then changed onClose to spawn a new go routing to avoid this deadlock and some other tests failed because they relied on onClose to be executed synchronously.

Reporting READY before IDLE should theoretically work. We have locked ac.mu while sending the IDLE notification. This means that no one can read the transport until the lock is released. If we send an update for READY just before IDLE, its guaranteed that no one reads the transport b/w the two updates. addrConn.getTransport() is the method used by balancer_wrapper.NewStream(). getTransport needs a lock on ac.mu as well, so it will never see the addrConn in READY. This will cause churn in the LB policy since it will generate a picker update just to enter IDLE once the IDLE notification is received.

We decided to handle the CONNECTING to IDLE in the new pickfirst instead of trying to change this transition. The existing pickfirst correctly enters IDLE on seeing such an update.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Feb 17, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

1 participant