Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

clusterresolver: switch a couple of tests to e2e style #6394

Merged
merged 3 commits into from
Jun 23, 2023

Conversation

easwars
Copy link
Contributor

@easwars easwars commented Jun 23, 2023

This change helps with an upcoming change where I plan to switch the EDS watch to use the generic xDS client API.

#resource-agnostic-xdsclient-api

RELEASE NOTES: none

@easwars easwars requested a review from zasweq June 23, 2023 15:43
@easwars easwars added this to the 1.57 Release milestone Jun 23, 2023
Copy link
Contributor

@zasweq zasweq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments.


// testChildPolicyBalancerBuilder wraps the priority LB policy (which is the
// child policy for the cluster resolver LB policy), and makes certain events
// available to the test (subConn state changes and config pushes).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: instead of config pushes (which I assume to come from UpdateClientConnState) can you instead say ParseConfig(). I know it should come 1:1 (ParseConfig than UpdateClientConnState) but I had to fix an issue in cds where we were bypassing this API here: https://github.com/grpc/grpc-go/pull/6361/files#diff-20fc9eddb97bc2324ff2b3ca5bd2cada8c7f4ad23776d547ad08e09087699bb0R376.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for bringing this up. I had initially thought about overriding UpdateClientConnState because that seems to be the more appropriate thing to do, but decided on getting the config from ParseConfig, purely because of laziness. Now, I switched the code to rely on UpdateClientConnState instead. This seems more correct to me.

case <-ctx.Done():
doneCh <- fmt.Errorf("test timeout expired when waiting for child policy to see a READY subConn")
return
case state := <-testChildPolicy.scStateCh:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this not racy? I.e. after spawning goroutine for this check, we're not guaranteed the processory actually hits this select, and the other operations from lines 421 - 428 can execute without this select running in a separate goroutine, and the line 362 hits, and we lose the Update?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great find. Switched to using a buffer.Unbounded because I want two properties here:

  • no state updates should be lost (as you pointed out)
  • the UpdateSubConnState method should not block when writing on the channel because the test goroutine isn't reading anymore. This can be lead to test failures as well with the leakchecker.

I didn't switch the other channel to a buffer.Unbounded because we don't have that issue there (at least with the current usage).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I was actually going to suggest this pattern. In order to not lose state, we need an arbitrary length buffer, and keep reading until READY received. So this change lgtm.

Comment on lines 517 to 529
select {
case jsonCfg := <-testChildPolicy.cfgCh:
gotCfg := &priority.LBConfig{}
if err := json.Unmarshal(jsonCfg, gotCfg); err != nil {
t.Fatalf("Failed to unmarshal LB config received by child policy: %v", err)
}
if diff := cmp.Diff(wantCfg, gotCfg); diff != "" {
t.Fatalf("Child policy received unexpected diff in config (-want +got):\n%s", diff)
}
case <-ctx.Done():
t.Fatalf("Timeout when waiting for child policy to receive its configuration")
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unlike the test above, this check/recv is not in a goroutine. Is there a reason for the divergence? (is it this is balancer.Balancer API vs. balancer.ClientConn in test above) Also, similar question here, I don't get how this isn't racy. Are we guaranteed this select is running by the time we send on cfgCh on line 348? Or can that send come before this is running losing this check and making test flaky?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't need a goroutine here because I'm not performing an RPC here. Technically, we don't need to perform an RPC in the previous test as well (to get the subChannel to READY), but I thought it was more appropriate to do the RPC there, because we also want to ensure that a READY subchannel also means successful RPCs.

This is not racy here because we expect only one config update to be pushed to the child policy, and the channel used by the wrapped balancer has a buffer of 1 (I thought I had a buffer of 1, but didn't have it, so I added it now). This means that the writer will not be blocked even if the reading goroutine has not started running.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant racy with the send being lost do the default in the case of the channel not having buffer of 1. Now that is has buffer of 1 and we expect 1 config update, this check sgtm. Either it sends before or after it gets to this select, and either case works :).

@zasweq zasweq assigned easwars and unassigned zasweq Jun 23, 2023
Copy link
Contributor Author

@easwars easwars left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the detailed review.


// testChildPolicyBalancerBuilder wraps the priority LB policy (which is the
// child policy for the cluster resolver LB policy), and makes certain events
// available to the test (subConn state changes and config pushes).
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for bringing this up. I had initially thought about overriding UpdateClientConnState because that seems to be the more appropriate thing to do, but decided on getting the config from ParseConfig, purely because of laziness. Now, I switched the code to rely on UpdateClientConnState instead. This seems more correct to me.

case <-ctx.Done():
doneCh <- fmt.Errorf("test timeout expired when waiting for child policy to see a READY subConn")
return
case state := <-testChildPolicy.scStateCh:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great find. Switched to using a buffer.Unbounded because I want two properties here:

  • no state updates should be lost (as you pointed out)
  • the UpdateSubConnState method should not block when writing on the channel because the test goroutine isn't reading anymore. This can be lead to test failures as well with the leakchecker.

I didn't switch the other channel to a buffer.Unbounded because we don't have that issue there (at least with the current usage).

Comment on lines 517 to 529
select {
case jsonCfg := <-testChildPolicy.cfgCh:
gotCfg := &priority.LBConfig{}
if err := json.Unmarshal(jsonCfg, gotCfg); err != nil {
t.Fatalf("Failed to unmarshal LB config received by child policy: %v", err)
}
if diff := cmp.Diff(wantCfg, gotCfg); diff != "" {
t.Fatalf("Child policy received unexpected diff in config (-want +got):\n%s", diff)
}
case <-ctx.Done():
t.Fatalf("Timeout when waiting for child policy to receive its configuration")
}
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't need a goroutine here because I'm not performing an RPC here. Technically, we don't need to perform an RPC in the previous test as well (to get the subChannel to READY), but I thought it was more appropriate to do the RPC there, because we also want to ensure that a READY subchannel also means successful RPCs.

This is not racy here because we expect only one config update to be pushed to the child policy, and the channel used by the wrapped balancer has a buffer of 1 (I thought I had a buffer of 1, but didn't have it, so I added it now). This means that the writer will not be blocked even if the reading goroutine has not started running.

Copy link
Contributor Author

@easwars easwars left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the detailed review.

@easwars easwars assigned zasweq and unassigned easwars Jun 23, 2023
@easwars
Copy link
Contributor Author

easwars commented Jun 23, 2023

I saw a weird race reported here as part of accessing the balancer registry. It's not entirely related to this test and can actually happen in any test. Maybe we should rethink our rationale for not having a mutex to protect the map in all of our registries (resolver/balancer etc).

@zasweq
Copy link
Contributor

zasweq commented Jun 23, 2023

That race is interesting, in my opinion. Thinking about it, the global map is protected because the registry is called on init() time, which has a deterministic happens before relationship wrt init() called when you build a binary, thus protecting the map that way. It's weird though that there's races in tests unless tests are running concurrently, because tests run after init() I think.

// to the child policy. We *only* wait for READY state since we expect to
// get to that state before the below RPC succeeds.
doneCh := make(chan error, 1)
go func() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious, now that it's a buffer.Unbounded and the write doesn't block, you could technically do this after line 434 in main test goroutine right? The sc updates get pushed to buffer in a non blocking way then you read after?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Done.

Copy link
Contributor

@zasweq zasweq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@easwars
Copy link
Contributor Author

easwars commented Jun 23, 2023

That race is interesting, in my opinion. Thinking about it, the global map is protected because the registry is called on init() time, which has a deterministic happens before relationship wrt init() called when you build a binary, thus protecting the map that way. It's weird though that there's races in tests unless tests are running concurrently, because tests run after init() I think.

There are races in tests because we unregister and register things to the registry from tests (not necessarily from the init function in tests), and other non-test code read from the registry (again in non-init functions). I don't see any strong reason for why we shouldn't protect the map in the registry with a mutex. I've filed #6408 for us to discuss this in more detail (and also to ensure that we don't lose track of it).

@zasweq
Copy link
Contributor

zasweq commented Jun 23, 2023

Ok, I see why there would be a race. Doesn't the go test guarantee one at a time not in parallel, so within a single test, if you have a write and read async. Issue sgtm since it's not on rpc critical path so no harm with protecting mutex and adding some instructions in init() time and tests.

@easwars
Copy link
Contributor Author

easwars commented Jun 23, 2023

Ok, I see why there would be a race. Doesn't the go test guarantee one at a time not in parallel, so within a single test, if you have a write and read async. Issue sgtm since it's not on rpc critical path so no harm with protecting mutex and adding some instructions in init() time and tests.

In a test we might do the following:

  • unregister foo from the registry
  • register a dummy foo for the duration of the test
  • defer a register of the original foo

In the intervening timeframe (when the test is running), the code might also read from the registry. For example:

bb := balancer.Get(newSubConfig.Config.Name)

Hence the race, even if Go guarantees that it will only run one test at a time.

@easwars easwars merged commit 0673105 into grpc:master Jun 23, 2023
1 check passed
@easwars easwars deleted the clusterresolver_test_cleanup_part_4 branch June 23, 2023 20:51
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jun 2, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants