xdsclient: new Transport interface and LRS stream implementation #7717

easwars · 2024-10-09T19:01:32Z

#a71-xds-fallback
#xdsclient-refactor

The existing structure of the xDS client is as follows:

the xDS client has an authority type for each authority configuration in the bootstrap (ignoring authority sharing)
each authority has a Transport which contains a grpc.ClientConn to the xDS management server
the Transport type provides the following functionality
- runs an ADS stream, and allows the authority to trigger a DiscoveryRequest to be sent
- runs an LRS stream, and allows the authority to start the load reporting

The new structure for the xDS client will be as follows:

the xDS client will have one authority type for each authority configuration in the bootstrap (even if the authority configuration are the same or have the same server configuration)
the xDS client will own a bunch of xdsChannels, one each for each server configuration specified in the bootstrap
each authority will acquire references to one of more xdsChannel instances
each xdsChannel will contain the following
- a Transport to the xDS management server. This will be an interface allowing for non-grpc transports to be used.
- an ADS stream instance that runs an ADS stream and supports resource subscription/unsubscription.
- an LRS stream instance that runs an LRS stream and support starting and stopping the stream.

This PR introduces the following functionality:

Defines the Transport interface and provides a gRPC transport implementation.
An LRS stream implementation.

The current LRS implementation can be found in https://github.com/grpc/grpc-go/blob/master/xds/internal/xdsclient/transport/loadreport.go, and this PR's implementation is heavily based off of it.

Subsequent PRs will add more functionatlity.

Addresses #6902

RELEASE NOTES: none

codecov · 2024-10-09T19:04:24Z

Codecov Report

Attention: Patch coverage is 7.96020% with 185 lines in your changes missing coverage. Please review.

Project coverage is 81.64%. Comparing base (ec10e73) to head (434d43b).
Report is 1 commits behind head on master.

Files with missing lines	Patch %	Lines
xds/internal/xdsclient/transport/lrs/lrs_stream.go	0.00%	154 Missing ⚠️
...xdsclient/transport/grpctransport/grpctransport.go	34.04%	29 Missing and 2 partials ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #7717      +/-   ##
==========================================
- Coverage   82.05%   81.64%   -0.42%     
==========================================
  Files         362      364       +2     
  Lines       28111    28312     +201     
==========================================
+ Hits        23067    23114      +47     
- Misses       3845     4014     +169     
+ Partials     1199     1184      -15

Files with missing lines	Coverage Δ
...xdsclient/transport/grpctransport/grpctransport.go	`34.04% <34.04%> (ø)`
xds/internal/xdsclient/transport/lrs/lrs_stream.go	`0.00% <0.00%> (ø)`

... and 23 files with indirect coverage changes

xds/internal/xdsclient/transport/grpctransport/grpctransport.go

xds/internal/xdsclient/transport/grpctransport/grpctransport_ext_test.go

xds/internal/xdsclient/transport/lrs/lrs_stream.go

zasweq

LGTM with some minor nits.

xds/internal/xdsclient/transport/grpctransport/grpctransport.go

zasweq · 2024-10-15T16:54:17Z

xds/internal/xdsclient/transport/grpctransport/grpctransport_ext_test.go

+	grpctest.RunSubTests(t, s{})
+}
+
+// Tests that the grpctransport.Builder creates a new grpc.ClientConn every time


I don't really get the point of this test. You're hooking into new client to make sure it gets called, and then getting rid of it and then you make a transport and then immediately close it. To me it verifies only two things:

The function internal.GRPCNewClient is called from transport creation.

You successfully overwrote internal.GRPCNewClient in the first Dial, and the second one did not hit your overwritten function.

Modified the test to not reset the new client hook after the first call. This changes the logic of the test such that it verifies that everytime Build is called, then the new client hook is called.

Also, I have to call grpc.NewClient from the overridden function, because I need to return a *grpc.ClientConn. If I simply set customDialerCalled to true and return a nil for the first return value, that would lead to a panic, since the code actually calls cc.Connect once grpc.NewClient returns a non-nil error.

Oh I guess it would nil panic if not successfully connected, so it's testing that too. So it's testing Build/NewClient come 1:1?

No, why would it panic if it is not successfully connected? In fact, I don't even have a real management server running as part of this test. What I meant was if I passed nil for the first return value in the overridden client hook, that would cause the code to panic because it calls cc.Connect on it.

So it's testing Build/NewClient come 1:1?

Sort of. But it does not test that NewClient actually ends up establishing a connection, because that would mean that we are testing grpc.NewClient instead of this Build function.

xds/internal/xdsclient/transport/grpctransport/grpctransport_ext_test.go

xds/internal/xdsclient/transport/lrs/lrs_stream.go

zasweq · 2024-10-15T17:47:47Z

xds/internal/xdsclient/transport/lrs/lrs_stream.go

+	rInterval := resp.GetLoadReportingInterval()
+	if rInterval.CheckValid() != nil {
+		return nil, 0, fmt.Errorf("lrs: invalid load_reporting_interval: %v", err)
+	}
+	interval := rInterval.AsDuration()


I always wonder what to do here when I get a variable from a methodA(), and then typecast it to data I'll use eventually which is semantically the same thing.

I don't understand your comment. Do you want me to get rid of the local interval and inline rInterval.AsDuration() in the return statement?

No, I just don't know what the best practices in Go here are, since I this come up to. data := getData, data2 := getData.(Data) and then use data2 the rest of function, I don't know what to call data/data2.

Ah I see. I don't think there is any specific guidance around this in the style guide, or at least, I haven't seen it before.

Personally, if I'm doing

data := getData() data2, ok := data.(Data)

I would make the variable name for data as small as possible, since its scope is only a couple of lines. And I would make the variable name for data2 to be more meaningful, since that would probably have a bigger scope. I had left the names as they were in the previous code, but changed it now to be more descriptive for the second one. I couldn't use int for interval since that is a reserved name, and I didn't want to use i for interval since that is usually reserved for indices.

zasweq · 2024-10-15T17:50:37Z

xds/internal/xdsclient/transport/lrs/lrs_stream.go

+
+// recvFirstLoadStatsResponse receives the first LoadStatsResponse from the LRS
+// server.  Returns the following:
+//   - a list of cluster names requested by the server or an empty slice if the


Nit: I feel like empty slice is distinct from nil, which is currently being returned. len(nil slice) returns 0 but I think empty slice is []type{}.

Looks like the method on the load.Store to which the clusters returned from here is passed handles empty slices correctly by checking for len() == 0 instead of checking for nil.

See:

grpc-go/xds/internal/xdsclient/load/store.go

Line 63 in 54841ef

func (s *Store) Stats(clusterNames []string) []*Data {

But I think it also makes sense for me to return an empty slice here when the server requests for load from all clusters instead of returning a nil slice, because it is semantically different from returning a nil slice for other error conditions.

Yeah I just meant that in the case where you want all clusters, it states it was returning an empty slice but it was returning nil instead (which happened to also be what was being returned in error cases).

purnesh42H · 2024-10-16T19:42:26Z

xds/internal/xdsclient/transport/lrs/lrs_stream.go

+
+	if lrs.refCount != 0 {
+		lrs.refCount++
+		return lrs.lrsStore, cleanup


Let me know if understand this correctly

Multiple grpc clients can report load on the same load store through xds client

Only the first ReportLoad() call create the stream for LRS and all report stats go through that irrespective of how many grpc clients are reporting

Each grpc client, when they are done reporting, calls cleanup which decrement the refCount

Only the last grpc client, when its done reporting, calls cleanup and lrs stream is destroyed

Multiple grpc clients can report load on the same load store through xds client

A single xds client is shared across grpc channels (with the same target URI) and grpc servers. Load reporting is a client-side feature, so let's forget about servers for now. Load is reported currently by the clusterimpl LB policy, which is a per-cluster LB policy. Load reports for all clusters within a single grpc client go through the same xDS client. They all share the same load store, which supports recording loads for multiple clusters.

Only the first ReportLoad() call create the stream for LRS and all report stats go through that irrespective of how many grpc clients are reporting

More specifically, the first call to ReportLoad that causes the ref count to become 1.

Each grpc client, when they are done reporting, calls cleanup which decrement the refCount

Again, this is not per grpc client. This is per clusterimpl policy (or whichever entity is responsible for reporting load)

Only the last grpc client, when its done reporting, calls cleanup and lrs stream is destroyed

The call to cleanup that causes the ref count to go to 0 will result in the underlying stream being cleaned up.

purnesh42H · 2024-10-16T19:50:37Z

xds/internal/xdsclient/transport/lrs/lrs_stream.go

+func (lrs *StreamImpl) runner(ctx context.Context) {
+	defer close(lrs.doneCh)
+
+	// This feature indicates that the client supports the


Does supports_send_all_clusters means the client should report load statistics for all clusters it's aware of, even if they weren't explicitly requested?

This is a client feature, i.e. something that the client supports. See: https://www.envoyproxy.io/docs/envoy/latest/api/client_features.

purnesh42H · 2024-10-16T19:55:02Z

xds/internal/xdsclient/transport/lrs/lrs_stream.go

+//   - any error encountered
+//
+// If the server requests for endpoint-level load reporting, an error is
+// returned, since this is not yet supported.


what is the meaning of endpoint-level load reporting not being supported? Does it mean LoadStat doesn't doesn't support that yet? Is that something we will have to support in future?

Got rid of this, since the other languages don't support this and we don't have any plans of supporting this at this point as a cross-language feature.

easwars requested review from zasweq and purnesh42H October 9, 2024 19:01

easwars assigned zasweq and purnesh42H Oct 9, 2024

easwars added the Type: Internal Cleanup Refactors, etc label Oct 9, 2024

easwars added this to the 1.68 Release milestone Oct 9, 2024

easwars force-pushed the lrs_stream_implementation branch from 3118c9e to 8e32c21 Compare October 9, 2024 23:11

purnesh42H reviewed Oct 10, 2024

View reviewed changes

xds/internal/xdsclient/transport/grpctransport/grpctransport.go Show resolved Hide resolved

xds/internal/xdsclient/transport/grpctransport/grpctransport.go Show resolved Hide resolved

purnesh42H reviewed Oct 10, 2024

View reviewed changes

zasweq reviewed Oct 11, 2024

View reviewed changes

xds/internal/xdsclient/transport/lrs/lrs_stream.go Show resolved Hide resolved

zasweq requested changes Oct 11, 2024

View reviewed changes

zasweq assigned easwars and unassigned zasweq Oct 11, 2024

zasweq approved these changes Oct 15, 2024

View reviewed changes

zasweq assigned easwars and unassigned easwars Oct 15, 2024

easwars removed their assignment Oct 15, 2024

purnesh42H modified the milestones: 1.68 Release, 1.69 Release Oct 16, 2024

purnesh42H reviewed Oct 16, 2024

View reviewed changes

easwars added 7 commits October 17, 2024 03:31

xdsclient: new Transport interface and LRS stream implementation

04ab0fb

make vet happy

35f2911

make vet happy after rebase

90f8ea0

first round of review comments from zasweq

c9a84a4

more review comments

d78649d

rename lrs.Stream to lrs.StreamImpl

c4e85bd

remove mentioned about endpoint granularity load reporting

434d43b

easwars force-pushed the lrs_stream_implementation branch from bba916f to 434d43b Compare October 17, 2024 03:31

easwars merged commit d2ded4b into grpc:master Oct 17, 2024
15 checks passed

easwars deleted the lrs_stream_implementation branch October 17, 2024 03:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

xdsclient: new Transport interface and LRS stream implementation #7717

xdsclient: new Transport interface and LRS stream implementation #7717

easwars commented Oct 9, 2024

codecov bot commented Oct 9, 2024 •

edited

Loading

zasweq left a comment

zasweq Oct 15, 2024

easwars Oct 15, 2024

zasweq Oct 15, 2024 •

edited

Loading

easwars Oct 16, 2024

zasweq Oct 15, 2024

easwars Oct 15, 2024

zasweq Oct 15, 2024

easwars Oct 16, 2024

zasweq Oct 15, 2024

easwars Oct 15, 2024

zasweq Oct 15, 2024

purnesh42H Oct 16, 2024

easwars Oct 17, 2024

purnesh42H Oct 16, 2024

easwars Oct 17, 2024

purnesh42H Oct 16, 2024

easwars Oct 17, 2024

xdsclient: new Transport interface and LRS stream implementation #7717

xdsclient: new Transport interface and LRS stream implementation #7717

Conversation

easwars commented Oct 9, 2024

codecov bot commented Oct 9, 2024 • edited Loading

Codecov Report

zasweq left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zasweq Oct 15, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Oct 9, 2024 •

edited

Loading

zasweq Oct 15, 2024 •

edited

Loading