Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

xdsclient: new Transport interface and LRS stream implementation #7717

Merged
merged 7 commits into from
Oct 17, 2024

Conversation

easwars
Copy link
Contributor

@easwars easwars commented Oct 9, 2024

#a71-xds-fallback
#xdsclient-refactor

The existing structure of the xDS client is as follows:

  • the xDS client has an authority type for each authority configuration in the bootstrap (ignoring authority sharing)
  • each authority has a Transport which contains a grpc.ClientConn to the xDS management server
  • the Transport type provides the following functionality
    • runs an ADS stream, and allows the authority to trigger a DiscoveryRequest to be sent
    • runs an LRS stream, and allows the authority to start the load reporting

The new structure for the xDS client will be as follows:

  • the xDS client will have one authority type for each authority configuration in the bootstrap (even if the authority configuration are the same or have the same server configuration)
  • the xDS client will own a bunch of xdsChannels, one each for each server configuration specified in the bootstrap
  • each authority will acquire references to one of more xdsChannel instances
  • each xdsChannel will contain the following
    • a Transport to the xDS management server. This will be an interface allowing for non-grpc transports to be used.
    • an ADS stream instance that runs an ADS stream and supports resource subscription/unsubscription.
    • an LRS stream instance that runs an LRS stream and support starting and stopping the stream.

This PR introduces the following functionality:

  • Defines the Transport interface and provides a gRPC transport implementation.
  • An LRS stream implementation.

The current LRS implementation can be found in https://github.com/grpc/grpc-go/blob/master/xds/internal/xdsclient/transport/loadreport.go, and this PR's implementation is heavily based off of it.

Subsequent PRs will add more functionatlity.

Addresses #6902

RELEASE NOTES: none

@easwars easwars requested review from zasweq and purnesh42H October 9, 2024 19:01
@easwars easwars added the Type: Internal Cleanup Refactors, etc label Oct 9, 2024
@easwars easwars added this to the 1.68 Release milestone Oct 9, 2024
Copy link

codecov bot commented Oct 9, 2024

Codecov Report

Attention: Patch coverage is 7.96020% with 185 lines in your changes missing coverage. Please review.

Project coverage is 81.64%. Comparing base (ec10e73) to head (434d43b).
Report is 1 commits behind head on master.

Files with missing lines Patch % Lines
xds/internal/xdsclient/transport/lrs/lrs_stream.go 0.00% 154 Missing ⚠️
...xdsclient/transport/grpctransport/grpctransport.go 34.04% 29 Missing and 2 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #7717      +/-   ##
==========================================
- Coverage   82.05%   81.64%   -0.42%     
==========================================
  Files         362      364       +2     
  Lines       28111    28312     +201     
==========================================
+ Hits        23067    23114      +47     
- Misses       3845     4014     +169     
+ Partials     1199     1184      -15     
Files with missing lines Coverage Δ
...xdsclient/transport/grpctransport/grpctransport.go 34.04% <34.04%> (ø)
xds/internal/xdsclient/transport/lrs/lrs_stream.go 0.00% <0.00%> (ø)

... and 23 files with indirect coverage changes

@easwars easwars force-pushed the lrs_stream_implementation branch from 3118c9e to 8e32c21 Compare October 9, 2024 23:11
@zasweq zasweq assigned easwars and unassigned zasweq Oct 11, 2024
Copy link
Contributor

@zasweq zasweq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with some minor nits.

grpctest.RunSubTests(t, s{})
}

// Tests that the grpctransport.Builder creates a new grpc.ClientConn every time
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really get the point of this test. You're hooking into new client to make sure it gets called, and then getting rid of it and then you make a transport and then immediately close it. To me it verifies only two things:

  1. The function internal.GRPCNewClient is called from transport creation.
  2. You successfully overwrote internal.GRPCNewClient in the first Dial, and the second one did not hit your overwritten function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modified the test to not reset the new client hook after the first call. This changes the logic of the test such that it verifies that everytime Build is called, then the new client hook is called.

Also, I have to call grpc.NewClient from the overridden function, because I need to return a *grpc.ClientConn. If I simply set customDialerCalled to true and return a nil for the first return value, that would lead to a panic, since the code actually calls cc.Connect once grpc.NewClient returns a non-nil error.

Copy link
Contributor

@zasweq zasweq Oct 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I guess it would nil panic if not successfully connected, so it's testing that too. So it's testing Build/NewClient come 1:1?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, why would it panic if it is not successfully connected? In fact, I don't even have a real management server running as part of this test. What I meant was if I passed nil for the first return value in the overridden client hook, that would cause the code to panic because it calls cc.Connect on it.

So it's testing Build/NewClient come 1:1?

Sort of. But it does not test that NewClient actually ends up establishing a connection, because that would mean that we are testing grpc.NewClient instead of this Build function.

Comment on lines 239 to 243
rInterval := resp.GetLoadReportingInterval()
if rInterval.CheckValid() != nil {
return nil, 0, fmt.Errorf("lrs: invalid load_reporting_interval: %v", err)
}
interval := rInterval.AsDuration()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I always wonder what to do here when I get a variable from a methodA(), and then typecast it to data I'll use eventually which is semantically the same thing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand your comment. Do you want me to get rid of the local interval and inline rInterval.AsDuration() in the return statement?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I just don't know what the best practices in Go here are, since I this come up to. data := getData, data2 := getData.(Data) and then use data2 the rest of function, I don't know what to call data/data2.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see. I don't think there is any specific guidance around this in the style guide, or at least, I haven't seen it before.

Personally, if I'm doing

data := getData()
data2, ok := data.(Data)

I would make the variable name for data as small as possible, since its scope is only a couple of lines. And I would make the variable name for data2 to be more meaningful, since that would probably have a bigger scope. I had left the names as they were in the previous code, but changed it now to be more descriptive for the second one. I couldn't use int for interval since that is a reserved name, and I didn't want to use i for interval since that is usually reserved for indices.


// recvFirstLoadStatsResponse receives the first LoadStatsResponse from the LRS
// server. Returns the following:
// - a list of cluster names requested by the server or an empty slice if the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I feel like empty slice is distinct from nil, which is currently being returned. len(nil slice) returns 0 but I think empty slice is []type{}.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like the method on the load.Store to which the clusters returned from here is passed handles empty slices correctly by checking for len() == 0 instead of checking for nil.

See:

func (s *Store) Stats(clusterNames []string) []*Data {

But I think it also makes sense for me to return an empty slice here when the server requests for load from all clusters instead of returning a nil slice, because it is semantically different from returning a nil slice for other error conditions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I just meant that in the case where you want all clusters, it states it was returning an empty slice but it was returning nil instead (which happened to also be what was being returned in error cases).

@zasweq zasweq assigned easwars and unassigned easwars Oct 15, 2024
@easwars easwars removed their assignment Oct 15, 2024
@purnesh42H purnesh42H modified the milestones: 1.68 Release, 1.69 Release Oct 16, 2024

if lrs.refCount != 0 {
lrs.refCount++
return lrs.lrsStore, cleanup
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me know if understand this correctly

  • Multiple grpc clients can report load on the same load store through xds client
  • Only the first ReportLoad() call create the stream for LRS and all report stats go through that irrespective of how many grpc clients are reporting
  • Each grpc client, when they are done reporting, calls cleanup which decrement the refCount
  • Only the last grpc client, when its done reporting, calls cleanup and lrs stream is destroyed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Multiple grpc clients can report load on the same load store through xds client

A single xds client is shared across grpc channels (with the same target URI) and grpc servers. Load reporting is a client-side feature, so let's forget about servers for now. Load is reported currently by the clusterimpl LB policy, which is a per-cluster LB policy. Load reports for all clusters within a single grpc client go through the same xDS client. They all share the same load store, which supports recording loads for multiple clusters.

Only the first ReportLoad() call create the stream for LRS and all report stats go through that irrespective of how many grpc clients are reporting

More specifically, the first call to ReportLoad that causes the ref count to become 1.

Each grpc client, when they are done reporting, calls cleanup which decrement the refCount

Again, this is not per grpc client. This is per clusterimpl policy (or whichever entity is responsible for reporting load)

Only the last grpc client, when its done reporting, calls cleanup and lrs stream is destroyed

The call to cleanup that causes the ref count to go to 0 will result in the underlying stream being cleaned up.

func (lrs *StreamImpl) runner(ctx context.Context) {
defer close(lrs.doneCh)

// This feature indicates that the client supports the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does supports_send_all_clusters means the client should report load statistics for all clusters it's aware of, even if they weren't explicitly requested?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a client feature, i.e. something that the client supports. See: https://www.envoyproxy.io/docs/envoy/latest/api/client_features.

// - any error encountered
//
// If the server requests for endpoint-level load reporting, an error is
// returned, since this is not yet supported.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the meaning of endpoint-level load reporting not being supported? Does it mean LoadStat doesn't doesn't support that yet? Is that something we will have to support in future?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got rid of this, since the other languages don't support this and we don't have any plans of supporting this at this point as a cross-language feature.

@easwars easwars force-pushed the lrs_stream_implementation branch from bba916f to 434d43b Compare October 17, 2024 03:31
@easwars easwars merged commit d2ded4b into grpc:master Oct 17, 2024
15 checks passed
@easwars easwars deleted the lrs_stream_implementation branch October 17, 2024 03:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants