Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add ErrNoSubConnAvailable reason to pick ctx timeout #7143

Closed
wants to merge 3 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions picker_wrapper.go
Original file line number Diff line number Diff line change
Expand Up @@ -151,6 +151,7 @@ func (pw *pickerWrapper) pick(ctx context.Context, failfast bool, info balancer.
pickResult, err := p.Pick(info)
if err != nil {
if err == balancer.ErrNoSubConnAvailable {
lastPickErr = err
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not the right way to do this. ErrNoSubConnAvailable is not a meaningful error to a user.

Instead, we should probably change line 120's error to something like fmt.Sprintf("received context error while waiting for new LB policy update: %v", ctx.Err())

Copy link
Contributor Author

@holdno holdno Apr 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ctx.Err() could either indicate an issue with the implementation of a certain load balancer, or it might be caused by the absence of a corresponding server. However, balancer.ErrNoSubConnAvailable might give the client a clearer understanding of the reason for the error. Often, when clients encounter timeouts, they may assume there are issues with the network connection.
I believe that it would be more user-friendly for the client if the specific reason for the timeout were included along with the ctx.Err().

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ctx.Err means the RPC was timed out or canceled - we only know which by calling it.

If the RPC queues while it's queued waiting for an updated picker from the LB policy, that means the LB policy should still be attempting connections. If connections have failed, then it should return those connection errors instead.

What LB policy have you configured? Are you making your own? You may be misunderstanding how they are supposed to behave.

I believe that it would be more user-friendly for the client if the specific reason for the timeout were included along with the ctx.Err().

ErrNoSubConnAvailable is not a reason for a timeout. The reason for the timeout is that the LB policy isn't producing a picker that works for the RPC. That only should happen because it's still attempting a connection for that address. Once that address succeeds, a new picker will be produced that will work for that RPC. Or if it fails, a new picker will be produced that returns an appropriate error for the RPC. I think the proposal in my previous comment should help users understand where in the system the RPC timed out, and I don't think there's much more we can do.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your response, I hope we can have further discussions.

In my current scenario, during the gray release process, the methods corresponding to the new and old versions of the service are different. In the load balancer, I retrieve the service with the new method by using the fullMethodName. When there are no new version services online, I return an ErrNoSubConnAvailable error, which clearly indicates that there is indeed no effective connection available for the method the client is trying to access. However, the client only receives a timeout error and is not aware of the actual reason for the timeout.

I could indeed use a custom error to convey this scenario, but I feel that the semantics of the ErrNoSubConnAvailable error match my current situation well, and since it is a built-in error of the balancer, the client could also use this error to identify the type of error.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could indeed use a custom error to convey this scenario, but I feel that the semantics of the ErrNoSubConnAvailable error match my current situation well

I don't think so. You really should be returning an error in this case. Also, things generally don't work well if you have multiple versions of a service deployed. All clients should use the old version of the service until you know all servers have been upgraded, and only then should you update your clients to use the new version. Otherwise you will run into trouble with things like this.

ErrNoSubConnAvailable is not intended for an end user to see. It is just part of the API between the LB policy and the ClientConn.

Copy link
Contributor Author

@holdno holdno May 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, I got it. However, I think it is still necessary to clarify to the client the reason for the ctx.Error, such as a timeout in updating the client-side load balancer. This is user-friendly for those who are new to gRPC.

New PR for this. #7206

continue
}
if st, ok := status.FromError(err); ok {
Expand Down
18 changes: 18 additions & 0 deletions picker_wrapper_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,7 @@ import (
"time"

"google.golang.org/grpc/balancer"
"google.golang.org/grpc/balancer/base"
"google.golang.org/grpc/codes"
"google.golang.org/grpc/connectivity"
"google.golang.org/grpc/internal/transport"
Expand Down Expand Up @@ -74,6 +75,23 @@ func (s) TestBlockingPickTimeout(t *testing.T) {
}
}

func (s) TestBlockingPickErrNoSubConnAvailableTimeout(t *testing.T) {
bp := newPickerWrapper(nil)
ctx, cancel := context.WithTimeout(context.Background(), time.Millisecond)
defer cancel()

bp.updatePicker(base.NewErrPicker(balancer.ErrNoSubConnAvailable))
var err error
if _, _, err = bp.pick(ctx, true, balancer.PickInfo{}); status.Code(err) != codes.DeadlineExceeded {
t.Errorf("bp.pick returned error %v, want DeadlineExceeded", err)
}

serr, _ := status.FromError(err)
if serr.Message() != "latest balancer error: "+balancer.ErrNoSubConnAvailable.Error() {
t.Errorf("bp.pick returned error %v, want balancer.ErrNoSubConnAvailable", err)
}
}

func (s) TestBlockingPick(t *testing.T) {
bp := newPickerWrapper(nil)
// All goroutines should block because picker is nil in bp.
Expand Down
Loading