Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add ErrNoSubConnAvailable reason to pick ctx timeout #7143

Closed
wants to merge 3 commits into from

Conversation

holdno
Copy link
Contributor

@holdno holdno commented Apr 18, 2024

If the server-side balancer picks balancer.ErrNoSubConnAvailable, the gRPC picking logic will cause the context (ctx) to block until it times out. As a result, from the client's perspective, it appears as if there is a server response timeout, without knowing that it is due to NoSubConnAvailable.

#7135

@dfawley dfawley assigned aranjans and unassigned dfawley Apr 19, 2024
@aranjans aranjans added this to the 1.64 Release milestone Apr 25, 2024
@aranjans
Copy link
Contributor

aranjans commented Apr 25, 2024

@holdno There could be other errors due to which picker would be blocked, like no connections available or context deadline error here or ErrNoSubConnAvailable. Probably consider those cases as well.

@aranjans aranjans assigned holdno and unassigned aranjans Apr 25, 2024
@@ -151,6 +151,7 @@ func (pw *pickerWrapper) pick(ctx context.Context, failfast bool, info balancer.
pickResult, err := p.Pick(info)
if err != nil {
if err == balancer.ErrNoSubConnAvailable {
lastPickErr = err
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not the right way to do this. ErrNoSubConnAvailable is not a meaningful error to a user.

Instead, we should probably change line 120's error to something like fmt.Sprintf("received context error while waiting for new LB policy update: %v", ctx.Err())

Copy link
Contributor Author

@holdno holdno Apr 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ctx.Err() could either indicate an issue with the implementation of a certain load balancer, or it might be caused by the absence of a corresponding server. However, balancer.ErrNoSubConnAvailable might give the client a clearer understanding of the reason for the error. Often, when clients encounter timeouts, they may assume there are issues with the network connection.
I believe that it would be more user-friendly for the client if the specific reason for the timeout were included along with the ctx.Err().

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ctx.Err means the RPC was timed out or canceled - we only know which by calling it.

If the RPC queues while it's queued waiting for an updated picker from the LB policy, that means the LB policy should still be attempting connections. If connections have failed, then it should return those connection errors instead.

What LB policy have you configured? Are you making your own? You may be misunderstanding how they are supposed to behave.

I believe that it would be more user-friendly for the client if the specific reason for the timeout were included along with the ctx.Err().

ErrNoSubConnAvailable is not a reason for a timeout. The reason for the timeout is that the LB policy isn't producing a picker that works for the RPC. That only should happen because it's still attempting a connection for that address. Once that address succeeds, a new picker will be produced that will work for that RPC. Or if it fails, a new picker will be produced that returns an appropriate error for the RPC. I think the proposal in my previous comment should help users understand where in the system the RPC timed out, and I don't think there's much more we can do.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your response, I hope we can have further discussions.

In my current scenario, during the gray release process, the methods corresponding to the new and old versions of the service are different. In the load balancer, I retrieve the service with the new method by using the fullMethodName. When there are no new version services online, I return an ErrNoSubConnAvailable error, which clearly indicates that there is indeed no effective connection available for the method the client is trying to access. However, the client only receives a timeout error and is not aware of the actual reason for the timeout.

I could indeed use a custom error to convey this scenario, but I feel that the semantics of the ErrNoSubConnAvailable error match my current situation well, and since it is a built-in error of the balancer, the client could also use this error to identify the type of error.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could indeed use a custom error to convey this scenario, but I feel that the semantics of the ErrNoSubConnAvailable error match my current situation well

I don't think so. You really should be returning an error in this case. Also, things generally don't work well if you have multiple versions of a service deployed. All clients should use the old version of the service until you know all servers have been upgraded, and only then should you update your clients to use the new version. Otherwise you will run into trouble with things like this.

ErrNoSubConnAvailable is not intended for an end user to see. It is just part of the API between the LB policy and the ClientConn.

Copy link
Contributor Author

@holdno holdno May 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, I got it. However, I think it is still necessary to clarify to the client the reason for the ctx.Error, such as a timeout in updating the client-side load balancer. This is user-friendly for those who are new to gRPC.

New PR for this. #7206

@aranjans aranjans added the P2 label Apr 30, 2024
@dfawley dfawley closed this May 6, 2024
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 4, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants