Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

grpc: eliminate panics in server worker implementation #6856

Merged
merged 5 commits into from
Dec 15, 2023

Conversation

easwars
Copy link
Contributor

@easwars easwars commented Dec 14, 2023

This PR eliminates two possible panics:

  • write to a closed channel
    • this was possible when the close of the serverWorkerChannel from Stop/GracefulStop was racing with handling of a new stream
  • close of a closed channel
    • this was possible when Stop/GracefulStop was called more than once

Fixes #6854

RELEASE NOTES:

  • server: fix two bugs that could lead to panics at shutdown when using NumStreamWorkers (experimental feature).

This PR eliminates two possible panics:
- write to a closed channel
  - this was possible when the close of the serverWorkerChannel from
    Stop/GracefulStop was racing with handling of a new stream.
- close of a closed channel
  - this was possible when Stop/GracefulStop was called more than once.
@easwars easwars requested a review from dfawley December 14, 2023 19:24
@easwars easwars added this to the 1.61 Release milestone Dec 14, 2023
Copy link

codecov bot commented Dec 14, 2023

Codecov Report

Merging #6856 (afc5e88) into master (43e4461) will decrease coverage by 0.03%.
Report is 2 commits behind head on master.
The diff coverage is 100.00%.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #6856      +/-   ##
==========================================
- Coverage   83.64%   83.61%   -0.03%     
==========================================
  Files         286      286              
  Lines       30801    30786      -15     
==========================================
- Hits        25763    25743      -20     
- Misses       3978     3979       +1     
- Partials     1060     1064       +4     
Files Coverage Δ
server.go 82.00% <100.00%> (+2.68%) ⬆️

... and 16 files with indirect coverage changes

defer wg.Done()
for {
if _, err := client.EmptyCall(ctx, &testpb.Empty{}); err != nil {
t.Logf("EmptyCall() failed: %v", err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We expect some to fail with unavailable, right? So maybe silently ignore those and Error for other errors?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, what I see is that once the server listener is closed, the client channel moves to IDLE --> CONNECTING --> TRANSIENT_FAILURE. So, the RPCs fail with deadline exceeded.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, some fail with UNAVAILABLE as well.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That seems wrong, since they're not wait-for-ready RPCs. They should all fail immediately with UNAVAILABLE if there is no working connection to a server.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense now. Switched to test deadline for the RPCs, which ensures that there is enough time for the the LB policy to kick the subConn into connecting (after it goes IDLE on transport closure) and then into TF.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These probably truly belong in /server_ext_test.go instead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the concept of these _ext_test.go files?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main point is that they're not in the grpc package, but they're adjacent to the code they're testing in the grpc package.

The test package is too big. It has many tests of many sub-components, and we should be looking to move tests closer to the code they're testing. Also growing this one package means it doesn't scale, since the tests within it can't run in parallel.

@easwars
Copy link
Contributor Author

easwars commented Dec 14, 2023

Handled the comments. PTAL. Thanks.

@dfawley dfawley assigned easwars and unassigned dfawley Dec 14, 2023
Comment on lines 148 to 149
sCtx, sCancel := context.WithTimeout(ctx, defaultTestShortTimeout)
defer sCancel()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably why you're getting the deadline exceeded errors. Just leave these using the base ctx and it should be fine I think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh crap, didn't see this comment. But figure out the same anyways :). Could have saved 30m if I had looked earlier, but I think the investigation helped.

@easwars easwars assigned dfawley and unassigned easwars Dec 14, 2023
}
defer ccs[i].Close()
client := testgrpc.NewTestServiceClient(ccs[i])
if _, err := client.EmptyCall(ctx, &testpb.Empty{}, grpc.WaitForReady(true)); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: delete this WaitForReady as it should not be necessary and could persist a mistaken understanding that it might be necessary. (We used to have a bug waaay back when that would have required it.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@easwars easwars merged commit 45624f0 into grpc:master Dec 15, 2023
14 checks passed
@easwars easwars deleted the panic_6854 branch December 15, 2023 17:47
easwars added a commit to easwars/grpc-go that referenced this pull request Dec 18, 2023
easwars added a commit that referenced this pull request Dec 18, 2023
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jun 13, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

panic: send on closed channel
2 participants