pickfirst: Implement Happy Eyeballs #7725

arjan-bal · 2024-10-10T10:04:04Z

As part of the Dualstack design, the pickfirst policy should implement the happy eyeballs algorithm while connecting to multiple backends.

The timeout for the happy eyeballs connection timer is NOT configurable as that's an optional requirement in the gRFC.

RELEASE NOTES:

The new experimental pickfirst LB policy (disabled by default) supports Happy Eyeballs to attempt connections to multiple backends concurrently. The experimental pickfirst policy can be enabled by setting the environment variable GRPC_EXPERIMENTAL_ENABLE_NEW_PICK_FIRST to true.

codecov · 2024-10-10T10:13:01Z

Codecov Report

Attention: Patch coverage is 86.81319% with 12 lines in your changes missing coverage. Please review.

Project coverage is 81.74%. Comparing base (18d218d) to head (5c4ff49).
Report is 9 commits behind head on master.

Files with missing lines	Patch %	Lines
balancer/pickfirst/pickfirstleaf/pickfirstleaf.go	86.36%	9 Missing and 3 partials ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #7725      +/-   ##
==========================================
- Coverage   82.00%   81.74%   -0.27%     
==========================================
  Files         373      374       +1     
  Lines       37735    37930     +195     
==========================================
+ Hits        30945    31004      +59     
- Misses       5512     5615     +103     
- Partials     1278     1311      +33

Files with missing lines	Coverage Δ
balancer/pickfirst/internal/internal.go	`100.00% <100.00%> (ø)`
balancer/pickfirst/pickfirstleaf/pickfirstleaf.go	`88.83% <86.36%> (+0.06%)`	⬆️

... and 39 files with indirect coverage changes

easwars · 2024-10-10T19:02:20Z

Should we mention the environment variables in the release note? Or at least in the PR description?

easwars

I couldn't complete a full pass, but some comment here to get satrted.

balancer/pickfirst/pickfirstleaf/pickfirstleaf.go

internal/envconfig/envconfig.go

balancer/pickfirst/pickfirstleaf/pickfirstleaf.go

arjan-bal · 2024-10-11T06:41:10Z

Should we mention the environment variables in the release note? Or at least in the PR description?

Updated the release notes.

.github/workflows/testing.yml

balancer/pickfirst/pickfirstleaf/pickfirstleaf.go

… comments

easwars

LGTM. Some minor nits in the tests.

balancer/pickfirst/pickfirstleaf/pickfirstleaf_ext_test.go

easwars · 2024-10-22T22:17:50Z

balancer/pickfirst/pickfirstleaf/pickfirstleaf_ext_test.go

+	// Replace the timer channel so that the old timers don't attempt to read
+	// messages pushed next.


Old timers should get canceled when subsequent subchannels are created, right? Why do we need to do this?

This is required since pickfirst will stop the timer, but the fake TimeAfterFunc will still keep waiting on the timer channel till the context is cancelled. If there are multiple listeners on the timer channel, they will race to read from the channel.

This could be avoided by introducing an interface for a time.Timer so that the test can intercept calls to Timer.Stop().

I see what you are saying. That seems better to me, unless it is too much work.

Refactored to have the internal.TimeAfterFunc return a cancelFunc() instead of a timer. This allowed the test to stop the timer when pickfirst cancels the timer. I also created a helper function to return a timer function and a function to trigger the timer manually instead of having the tests write on channel.

…yeballs

easwars · 2024-10-24T12:45:40Z

balancer/pickfirst/pickfirstleaf/pickfirstleaf_ext_test.go

+	testutils.AwaitNotState(shortCtx, t, cc, connectivity.TransientFailure)
+
+	// Third SubConn fails.
+	shortCancel()


Do we need this? Won't testutils.AwaitNotState fail the test if the specified state is reached before the context expires?

It's not required because of the way testutils.AwaitNotState works. When I tried to ignore the first cancel function as follows:

shortCtx, _ := context.WithTimeout(ctx, defaultTestShortTimeout)

govet complains about a possible context leak because it can't ensure that the context will be cancelled at compile time. If we re-assign the cancel func later, govet doesn't complain but I still called cancel just to be consistent. Removed the call now.

easwars · 2024-10-24T12:54:10Z

balancer/pickfirst/pickfirstleaf/pickfirstleaf_ext_test.go

+	// The happy eyeballs timer expires, skipping server[1] and requesting the creation
+	// of a third SubConn.


Why do you say we are skipping server[1] here? IIUC correctly:

we first started a connection to server[0]

connection to server[0] failed before the HE timer fired

so, we started a connection to server[1]

now, the HE timer has fired

so, we would start a connection to server[2]

I don't see where we are skipping server[1].

The test doesn't skip the server but it skips waiting for the SubConn to report a success or failure and moves on to the next SubConn. The comment was copied taken from Java's test case. I've improved the wording now.

balancer/pickfirst/pickfirstleaf/pickfirstleaf.go

dfawley · 2024-11-07T23:34:13Z

balancer/pickfirst/pickfirstleaf/pickfirstleaf.go

+			// The SubConn is being re-used and failed during a previous pass
+			// over the addressList. It has not completed backoff yet.
+			// Mark it as having failed and try the next address.
+			scd.connectionFailed = true


connectionFailed is a bit like lastErr != nil. Do we need both?

lastErr is used to update the picker at the end of the first pass. In the case where the last address in the list hasn't completed it's backoff from a previous attempt, scd.lastErr would store a non-nil error. This is why scd.lastErr is not reset when starting the first pass over a new address list.

scd.connectionFailed indicates if the subchannel has failed with the latest address list from the resolver. It is reset before staring the first pass.

Consider a subchannel is being re-used after getting a resolver update because it's address is present in the new address list. The subchannel has already failed, it has scd.lastErr set and scd.connectionFailed set to true. When the first pass starts, scd.connectionFailed is set to false.

If the subchannel completes backoff when the iteration over the address list reaches it, the subchannel will be connected since it's state is IDLE. When it fails again, scd.connectionFailed will be set to true and scd.lastErr will be updated.

If the subchannel is in backoff when the iteration over the address list reaches it, the subchannel will not be re-tried. scd.lastErr will be retained and scd.connectionFailed will be set to true.

The above steps ensure that the subchannel always has a non-nil error to update the picker.

OK, I see what's happening here, thanks for the explanation.

Maybe name it connectionFailed(In/During)FirstPass?

Renamed to connectionFailedInFirstPass.

balancer/pickfirst/pickfirstleaf/pickfirstleaf.go

internal/envconfig/envconfig.go

dfawley

LGTM modulo the one request to change connectionFailed to be a little more specific.

Thanks!!

balancer/pickfirst/pickfirstleaf/pickfirstleaf.go

dfawley · 2024-11-11T19:23:50Z

balancer/pickfirst/pickfirstleaf/pickfirstleaf.go

+			// The SubConn is being re-used and failed during a previous pass
+			// over the addressList. It has not completed backoff yet.
+			// Mark it as having failed and try the next address.
+			scd.connectionFailed = true


OK, I see what's happening here, thanks for the explanation.

Maybe name it connectionFailed(In/During)FirstPass?

arjan-bal added the Type: Feature New features or improvements in behavior label Oct 10, 2024

arjan-bal added this to the 1.68 Release milestone Oct 10, 2024

arjan-bal requested a review from easwars October 10, 2024 10:04

arjan-bal assigned easwars Oct 10, 2024

Implement happy eyeballs

db0dda7

arjan-bal force-pushed the grpc-go-happy-eyeballs branch from 7cb88fe to db0dda7 Compare October 10, 2024 10:08

easwars reviewed Oct 10, 2024

View reviewed changes

easwars assigned arjan-bal and unassigned easwars Oct 10, 2024

Use timeAfterFunc

826bb03

Address review comments

4e68e58

arjan-bal assigned easwars and unassigned arjan-bal Oct 11, 2024

zasweq reviewed Oct 11, 2024

View reviewed changes

.github/workflows/testing.yml Outdated Show resolved Hide resolved

easwars reviewed Oct 15, 2024

View reviewed changes

purnesh42H modified the milestones: 1.68 Release, 1.69 Release Oct 16, 2024

arjan-bal added 5 commits October 16, 2024 12:59

Move timer func to internal, improve log statement and address review…

fe69816

… comments

Remove env var

3022304

Change to e2e style test

0a3ffd3

Fix vet

6697267

Fix vet

67f7a1a

arjan-bal force-pushed the grpc-go-happy-eyeballs branch from 7f3065d to 67f7a1a Compare October 16, 2024 11:09

refactor test

9712ec5

arjan-bal force-pushed the grpc-go-happy-eyeballs branch from af38951 to 9712ec5 Compare October 16, 2024 11:40

arjan-bal assigned dfawley Oct 16, 2024

arjan-bal requested a review from dfawley October 16, 2024 19:18

easwars approved these changes Oct 22, 2024

View reviewed changes

easwars assigned arjan-bal and unassigned easwars Oct 22, 2024

arjan-bal added 2 commits October 23, 2024 15:30

Improve whitespaces and comments

8f63d8e

Merge branch 'master' of github.com:grpc/grpc-go into grpc-go-happy-e…

09f27c6

…yeballs

arjan-bal removed their assignment Oct 23, 2024

easwars assigned arjan-bal Oct 23, 2024

arjan-bal added 3 commits October 23, 2024 23:11

Refactor fake timer

6610516

Don't use expired context

d6bc007

Remove unnecessary timer in test

19a3165

arjan-bal removed their assignment Oct 23, 2024

easwars reviewed Oct 24, 2024

View reviewed changes

arjan-bal added 3 commits October 24, 2024 18:52

Address review comments

598fdd0

Merge remote-tracking branch 'source/master' into grpc-go-happy-eyeballs

d3bde50

Remove stale comment

8b4b28e

arjan-bal added the Area: Resolvers/Balancers Includes LB policy & NR APIs, resolver/balancer/picker wrappers, LB policy impls and utilities. label Nov 7, 2024

Use rand/v2

6c16943

dfawley reviewed Nov 7, 2024

View reviewed changes

dfawley assigned arjan-bal and unassigned dfawley Nov 7, 2024

Address review comments

11fe515

arjan-bal assigned dfawley and unassigned arjan-bal Nov 8, 2024

dfawley approved these changes Nov 11, 2024

View reviewed changes

dfawley assigned arjan-bal and unassigned dfawley Nov 11, 2024

Rename to connectionFailedInFirstPass

5c4ff49

arjan-bal merged commit e2b98f9 into grpc:master Nov 12, 2024
15 checks passed

arjan-bal deleted the grpc-go-happy-eyeballs branch November 12, 2024 09:04

arjan-bal mentioned this pull request Feb 6, 2025

NOTICE: Upcoming API changes to experimental Name Resolver (resolver) and LB Policy (balancer) packages #6472

Open

17 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pickfirst: Implement Happy Eyeballs #7725

pickfirst: Implement Happy Eyeballs #7725

arjan-bal commented Oct 10, 2024 •

edited

Loading

codecov bot commented Oct 10, 2024 •

edited

Loading

easwars commented Oct 10, 2024

easwars left a comment

arjan-bal commented Oct 11, 2024

easwars left a comment

easwars Oct 22, 2024

arjan-bal Oct 23, 2024

easwars Oct 23, 2024

arjan-bal Oct 23, 2024

easwars Oct 24, 2024

arjan-bal Oct 24, 2024

easwars Oct 24, 2024

arjan-bal Oct 24, 2024 •

edited

Loading

dfawley Nov 7, 2024

arjan-bal Nov 8, 2024

dfawley Nov 11, 2024

arjan-bal Nov 12, 2024

dfawley left a comment

dfawley Nov 11, 2024

		// Replace the timer channel so that the old timers don't attempt to read
		// messages pushed next.

		// The happy eyeballs timer expires, skipping server[1] and requesting the creation
		// of a third SubConn.

pickfirst: Implement Happy Eyeballs #7725

pickfirst: Implement Happy Eyeballs #7725

Conversation

arjan-bal commented Oct 10, 2024 • edited Loading

codecov bot commented Oct 10, 2024 • edited Loading

Codecov Report

easwars commented Oct 10, 2024

easwars left a comment

Choose a reason for hiding this comment

arjan-bal commented Oct 11, 2024

easwars left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

arjan-bal Oct 24, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dfawley left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

arjan-bal commented Oct 10, 2024 •

edited

Loading

codecov bot commented Oct 10, 2024 •

edited

Loading

arjan-bal Oct 24, 2024 •

edited

Loading