Streaming RPC handlers are left running after client connections closed and server is stopped #6921

sunjayBhatia · 2024-01-12T23:51:04Z

What version of gRPC are you using?

v1.60.1

What version of Go are you using (`go version`)?

1.21.6

What operating system (Linux, Windows, …) and version?

Ubuntu 22.04.3, macOS Ventura

What did you do/what did you see?

See reproduction here: https://github.com/sunjayBhatia/grpc-go-issue-repro/tree/main/shutdown-cleanup

Clone the repo above, change to the shutdown-cleanup directory, run: TEST_COUNT=10 make test or TEST_COUNT=10 make test-race (using a higher number of test runs ensure the issue shows up, usually it doesn't take more than a couple runs however)

The repro starts a simple server with a streaming RPC, starts a client connection, sends/receives a message, closes the client connection, stops the server, and exits. The server handler logs when connections are opened/closed. It is hooked up to the test logger, which panics if logs are written after a test ends. We can see in this test that the streaming handler is still running after the test ends: we see the panics with the log lines StreamEcho error in Recv ... and StreamEcho ended.

Leaving the repro code as-is, should see a failure like:

% TEST_COUNT=10 make test-race
go test -v -race -count=10 .
=== RUN   TestShutdownCleanupStream
    repro_test.go:64: test started
    repro_test.go:72: server starting
    repro_test.go:88: sending request
    repro_test.go:20: StreamEcho started
    repro_test.go:93: receiving response
    repro_test.go:98: response: text:"foo"
    repro_test.go:100: closing client connection
    repro_test.go:103: stopping server
    repro_test.go:106: server stop finished
    repro_test.go:77: server exited
    repro_test.go:109: test ended
--- PASS: TestShutdownCleanupStream (0.01s)
=== RUN   TestShutdownCleanupStream
panic: Log in goroutine after TestShutdownCleanupStream has completed: StreamEcho error in Recv error: rpc error: code = Canceled desc = context canceled
        panic: Log in goroutine after TestShutdownCleanupStream has completed: StreamEcho ended


goroutine 50 [running]:
testing.(*common).logDepth(0xc000218680, {0xc0000f0078, 0x11}, 0x3)
        /opt/homebrew/Cellar/go/1.21.6/libexec/src/testing/testing.go:1022 +0x4f8
testing.(*common).log(...)
        /opt/homebrew/Cellar/go/1.21.6/libexec/src/testing/testing.go:1004
testing.(*common).Log(0xc000218680, {0xc0000f93d8, 0x1, 0x1})
        /opt/homebrew/Cellar/go/1.21.6/libexec/src/testing/testing.go:1045 +0x74
panic({0x105068640?, 0xc0000b0580?})
        /opt/homebrew/Cellar/go/1.21.6/libexec/src/runtime/panic.go:920 +0x26c
testing.(*common).logDepth(0xc000218680, {0xc0000a0780, 0x52}, 0x3)
        /opt/homebrew/Cellar/go/1.21.6/libexec/src/testing/testing.go:1022 +0x4f8
testing.(*common).log(...)
        /opt/homebrew/Cellar/go/1.21.6/libexec/src/testing/testing.go:1004
testing.(*common).Logf(0xc000218680, {0x104f36bc7, 0xc}, {0xc0000f9748, 0x2, 0x2})
        /opt/homebrew/Cellar/go/1.21.6/libexec/src/testing/testing.go:1055 +0x84
github.com/sunjayBhatia/grpc-go-issue-repro/shutdown-cleanup_test.(*EchoServer).StreamEcho.func1(...)
        /Users/sunjayb/workspace/grpc-go-issue-repro/shutdown-cleanup/repro_test.go:24
github.com/sunjayBhatia/grpc-go-issue-repro/shutdown-cleanup_test.(*EchoServer).StreamEcho(0xc00003ae40, {0x105135ed8, 0xc0002861c0})
        /Users/sunjayb/workspace/grpc-go-issue-repro/shutdown-cleanup/repro_test.go:37 +0x330
github.com/sunjayBhatia/grpc-go-issue-repro/shutdown-cleanup/echo._EchoService_StreamEcho_Handler({0x10509a540?, 0xc00003ae40}, {0x1051343c0?, 0xc0002be000})
        /Users/sunjayb/workspace/grpc-go-issue-repro/shutdown-cleanup/echo/echo_grpc.pb.go:100 +0xb0
google.golang.org/grpc.(*Server).processStreamingRPC(0xc000174200, {0x1051333e8, 0xc000282270}, {0x1051360a8, 0xc0003081a0}, 0xc0002ae000, 0xc00021a8d0, 0x1054ab720, 0x0)
        /Users/sunjayb/go/pkg/mod/google.golang.org/grpc@v1.60.1/server.go:1666 +0x16f4
google.golang.org/grpc.(*Server).handleStream(0xc000174200, {0x1051360a8, 0xc0003081a0}, 0xc0002ae000)
        /Users/sunjayb/go/pkg/mod/google.golang.org/grpc@v1.60.1/server.go:1787 +0x12f0
google.golang.org/grpc.(*Server).serveStreams.func2.1()
        /Users/sunjayb/go/pkg/mod/google.golang.org/grpc@v1.60.1/server.go:1016 +0xa0
created by google.golang.org/grpc.(*Server).serveStreams.func2 in goroutine 37
        /Users/sunjayb/go/pkg/mod/google.golang.org/grpc@v1.60.1/server.go:1027 +0x1d8
FAIL    github.com/sunjayBhatia/grpc-go-issue-repro/shutdown-cleanup    0.663s
FAIL
make: *** [test-race] Error 1

Looks like the goroutine that spawns the stream handler is not waited for so it may still be running even after we think the server has shut down:

grpc-go/server.go

Line 1027 in 953d12a

go f()

Similarly, when using a set number of server stream workers (uncomment/comment the relevant lines in the repro code):

% TEST_COUNT=10 make test-race
go test -v -race -count=10 .
=== RUN   TestShutdownCleanupStream
    repro_test.go:64: test started
    repro_test.go:72: server starting
    repro_test.go:88: sending request
    repro_test.go:93: receiving response
    repro_test.go:20: StreamEcho started
    repro_test.go:98: response: text:"foo"
    repro_test.go:100: closing client connection
    repro_test.go:103: stopping server
    repro_test.go:106: server stop finished
    repro_test.go:109: test ended
--- PASS: TestShutdownCleanupStream (0.00s)
=== RUN   TestShutdownCleanupStream
    repro_test.go:64: test started
panic: Log in goroutine after TestShutdownCleanupStream has completed: StreamEcho error in Recv error: rpc error: code = Canceled desc = context canceled
        panic: Log in goroutine after TestShutdownCleanupStream has completed: StreamEcho ended


goroutine 20 [running]:
testing.(*common).logDepth(0xc000183520, {0xc00028e120, 0x11}, 0x3)
        /opt/homebrew/Cellar/go/1.21.6/libexec/src/testing/testing.go:1022 +0x4f8
testing.(*common).log(...)
        /opt/homebrew/Cellar/go/1.21.6/libexec/src/testing/testing.go:1004
testing.(*common).Log(0xc000183520, {0xc00022d378, 0x1, 0x1})
        /opt/homebrew/Cellar/go/1.21.6/libexec/src/testing/testing.go:1045 +0x74
panic({0x104d5c660?, 0xc00028a8b0?})
        /opt/homebrew/Cellar/go/1.21.6/libexec/src/runtime/panic.go:920 +0x26c
testing.(*common).logDepth(0xc000183520, {0xc000296960, 0x52}, 0x3)
        /opt/homebrew/Cellar/go/1.21.6/libexec/src/testing/testing.go:1022 +0x4f8
testing.(*common).log(...)
        /opt/homebrew/Cellar/go/1.21.6/libexec/src/testing/testing.go:1004
testing.(*common).Logf(0xc000183520, {0x104c2acc7, 0xc}, {0xc00022d6e8, 0x2, 0x2})
        /opt/homebrew/Cellar/go/1.21.6/libexec/src/testing/testing.go:1055 +0x84
github.com/sunjayBhatia/grpc-go-issue-repro/shutdown-cleanup_test.(*EchoServer).StreamEcho.func1(...)
        /Users/sunjayb/workspace/grpc-go-issue-repro/shutdown-cleanup/repro_test.go:24
github.com/sunjayBhatia/grpc-go-issue-repro/shutdown-cleanup_test.(*EchoServer).StreamEcho(0xc0001a2e20, {0x104e29f78, 0xc0003821c0})
        /Users/sunjayb/workspace/grpc-go-issue-repro/shutdown-cleanup/repro_test.go:37 +0x330
github.com/sunjayBhatia/grpc-go-issue-repro/shutdown-cleanup/echo._EchoService_StreamEcho_Handler({0x104d8e560?, 0xc0001a2e20}, {0x104e28460?, 0xc0003b4000})
        /Users/sunjayb/workspace/grpc-go-issue-repro/shutdown-cleanup/echo/echo_grpc.pb.go:100 +0xb0
google.golang.org/grpc.(*Server).processStreamingRPC(0xc00025c200, {0x104e27488, 0xc00038c210}, {0x104e2a148, 0xc00008a9c0}, 0xc0003ae000, 0xc0002688d0, 0x10519f720, 0x0)
        /Users/sunjayb/go/pkg/mod/google.golang.org/grpc@v1.60.1/server.go:1666 +0x16f4
google.golang.org/grpc.(*Server).handleStream(0xc00025c200, {0x104e2a148, 0xc00008a9c0}, 0xc0003ae000)
        /Users/sunjayb/go/pkg/mod/google.golang.org/grpc@v1.60.1/server.go:1787 +0x12f0
google.golang.org/grpc.(*Server).serveStreams.func2.1()
        /Users/sunjayb/go/pkg/mod/google.golang.org/grpc@v1.60.1/server.go:1016 +0xa0
google.golang.org/grpc.(*Server).serverWorker(0xc00025c200)
        /Users/sunjayb/go/pkg/mod/google.golang.org/grpc@v1.60.1/server.go:618 +0x38
created by google.golang.org/grpc.(*Server).initServerWorkers in goroutine 19
        /Users/sunjayb/go/pkg/mod/google.golang.org/grpc@v1.60.1/server.go:631 +0x1f0
FAIL    github.com/sunjayBhatia/grpc-go-issue-repro/shutdown-cleanup    0.511s
FAIL
make: *** [test-race] Error 1

It looks like server workers are left running without full coordination on shutdown as well:

grpc-go/server.go

Line 631 in 953d12a

go s.serverWorker()

This issue showed up in Contour CI after we bumped from grpc-go v1.59.0 to v1.60.0+

I've tracked this issue via git bisect down to this PR: #6489 which looks like it changed how coordination of stream handlers is set up. Looks like connection draining/closing might be working as expected, but stream handlers are not

What did you expect to see?

When server is stopped (gracefully or otherwise), it waits until all stream handlers finish to exit.

The text was updated successfully, but these errors were encountered:

dfawley · 2024-01-16T19:08:42Z

This would seem to be caused by this suggestion from that PR.

I wonder if there was a different root cause for the failures that were the reason we removed the waitgroup.

I agree this was a behavior change, and it would be good to bring the old behavior back (along with a test) since it was done that way before and is more ideal.

dfawley · 2024-01-16T19:30:48Z

(Still debugging...)

It seems like there must have been a bug with (or some other exception in) the previous implementation that did allow Stop to return without all handlers completing. E.g. this test performed a blocking Stop call explicitly with a handler running:

grpc-go/test/end2end_test.go

Line 1038 in 953d12a

ss.S.Stop()

So it does seem this would be a behavior change to block if handlers were still running.

Maybe only GracefulStop did this blocking, and Stop just returned immediately.

dfawley · 2024-01-16T19:34:57Z

I believe this (legacy) code in GracefulStop that was not present in Stop is what was causing us to wait for the method handlers to exit. We'll have to think about this a bit to see if we want to continue with this behavior or make Stop block until method handlers return.

grpc-go/server.go

Lines 1908 to 1910 in 82df321

    
           for len(s.conns) != 0 { 
        
           	s.cv.Wait() 
        
           }

It's arguably most correct to block until they exit, and I can't think of any valid reason for handlers to not respect the ctx given to them, which is the only way they would not exit. But it's still a breaking behavior change.

sunjayBhatia added the Type: Bug label Jan 12, 2024

sunjayBhatia mentioned this issue Jan 12, 2024

Featuretests test data race/panic due to Testing.Log called after test finishes projectcontour/contour#6055

Closed

dfawley self-assigned this Jan 16, 2024

dfawley mentioned this issue Jan 16, 2024

server: block GracefulStop on method handlers and make blocking optional for Stop #6922

Merged

dfawley closed this as completed in #6922 Jan 18, 2024

github-actions bot locked as resolved and limited conversation to collaborators Jul 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Streaming RPC handlers are left running after client connections closed and server is stopped #6921

Streaming RPC handlers are left running after client connections closed and server is stopped #6921

sunjayBhatia commented Jan 12, 2024 •

edited

Loading

dfawley commented Jan 16, 2024

dfawley commented Jan 16, 2024

dfawley commented Jan 16, 2024

Streaming RPC handlers are left running after client connections closed and server is stopped #6921

Streaming RPC handlers are left running after client connections closed and server is stopped #6921

Comments

sunjayBhatia commented Jan 12, 2024 • edited Loading

What version of gRPC are you using?

What version of Go are you using (go version)?

What operating system (Linux, Windows, …) and version?

What did you do/what did you see?

What did you expect to see?

dfawley commented Jan 16, 2024

dfawley commented Jan 16, 2024

dfawley commented Jan 16, 2024

sunjayBhatia commented Jan 12, 2024 •

edited

Loading

What version of Go are you using (`go version`)?