Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

badly-behaved client can deadlock server stopping #76

Open
jgraettinger opened this issue Jun 11, 2020 · 3 comments
Open

badly-behaved client can deadlock server stopping #76

jgraettinger opened this issue Jun 11, 2020 · 3 comments

Comments

@jgraettinger
Copy link

Documenting my findings debugging a production issue:

tl;dr is that a client can mess with stopping of a server, because the sniffing mechanism has no notion of draining for connections that have yet to be matched to a sub-listener. The specific scenario I encountered is:

  • grpc.Server.GracefulStop() is called, closes the wrapped net.Listener, and waits for the grpc Server's WaitGroup (note grpc.Server.Stop() would exhibit the same behavior, since it waits on the same WG)
  • That grpc WG can't finish until grpc.Serve finishes, which won't happen until it's listener.Accept() returns (a cmux listener).
  • That Accept doesn't return until the cmux WG is done / cmux.Serve() returns.
  • cmux.Serve() has identified that the wrapped listener error'd, and wants to return, but is blocked in a defer waiting for its WG.
  • The cmux WG is held hostage by a client which has connected but hasn't transmitted an opening frame for cmux to sniff.

Net effect is that grpc.Server.Stop/GracefulStop() & cmux.Serve() can't return until the client connection is remotely closed.

Not entirely sure what the right behavior here is. My gut take is that cmux Accept() should preserve the exit semantics of the wrapped listener Accept, and return its error even though there our outstanding, still-to-be-sniffed connections.

Collected traces:

crux.Serve has found that the wrapped listener Accept has error’d.
It’s trying to return, but is blocked on it's own WG within a defer:

goroutine 1798 [semacquire, 5 minutes]:
sync.runtime_Semacquire(0xc00066c008)
        /usr/local/go/src/runtime/sema.go:56 +0x42
sync.(*WaitGroup).Wait(0xc00066c000)
        /usr/local/go/src/sync/waitgroup.go:130 +0x64
github.com/soheilhy/cmux.(*cMux).Serve.func1(0xc00012a4b0, 0xc00066c000)
        /gazette/.build/go-path/pkg/mod/github.com/soheilhy/cmux@v0.1.4/cmux.go:150 +0x55
github.com/soheilhy/cmux.(*cMux).Serve(0xc00012a4b0, 0x1497b00, 0xc006278640)
        /gazette/.build/go-path/pkg/mod/github.com/soheilhy/cmux@v0.1.4/cmux.go:165 +0x120
go.gazette.dev/core/server.(*Server).QueueTasks.func1(0x20, 0x20)
        /gazette/server/server.go:124 +0x40
go.gazette.dev/core/task.(*Group).GoRun.func1(0x0, 0x0)
        /gazette/task/group.go:72 +0x43
golang.org/x/sync/errgroup.(*Group).Go.func1(0xc000530a80, 0xc001390080)
        /gazette/.build/go-path/pkg/mod/golang.org/x/sync@v0.0.0-20200317015054-43a5402ce75a/errgroup/errgroup.go:57 +0x59
created by golang.org/x/sync/errgroup.(*Group).Go
        /gazette/.build/go-path/pkg/mod/golang.org/x/sync@v0.0.0-20200317015054-43a5402ce75a/errgroup/errgroup.go:54 +0x66

That WG can't finish because a connection thread is stuck waiting to sniff an HTTP/2 header:

goroutine 491775429 [IO wait, 31 minutes]:
internal/poll.runtime_pollWait(0x7f7a20ed50f8, 0x72, 0xffffffffffffffff)
        /usr/local/go/src/runtime/netpoll.go:203 +0x55
internal/poll.(*pollDesc).wait(0xc0065afb98, 0x72, 0x0, 0x9, 0xffffffffffffffff)
        /usr/local/go/src/internal/poll/fd_poll_runtime.go:87 +0x45
internal/poll.(*pollDesc).waitRead(...)
        /usr/local/go/src/internal/poll/fd_poll_runtime.go:92
internal/poll.(*FD).Read(0xc0065afb80, 0xc001e66498, 0x9, 0x9, 0x0, 0x0, 0x0)
        /usr/local/go/src/internal/poll/fd_unix.go:169 +0x19b
net.(*netFD).Read(0xc0065afb80, 0xc001e66498, 0x9, 0x9, 0x865b8e, 0x10401, 0xc000000000)
        /usr/local/go/src/net/fd_unix.go:202 +0x4f
net.(*conn).Read(0xc0013b8020, 0xc001e66498, 0x9, 0x9, 0x0, 0x0, 0x0)
        /usr/local/go/src/net/net.go:184 +0x8e
github.com/soheilhy/cmux.(*bufferedReader).Read(0xc004b59820, 0xc001e66498, 0x9, 0x9, 0xc000184a80, 0x7f7a9f3327d0, 0x0)
        /gazette/.build/go-path/pkg/mod/github.com/soheilhy/cmux@v0.1.4/buffer.go:53 +0x12c
io.ReadAtLeast(0x1496d40, 0xc004b59820, 0xc001e66498, 0x9, 0x9, 0x9, 0x85db95, 0xc0091bf440, 0xc0091b0004)
        /usr/local/go/src/io/io.go:310 +0x87
io.ReadFull(...)
        /usr/local/go/src/io/io.go:329
golang.org/x/net/http2.readFrameHeader(0xc001e66498, 0x9, 0x9, 0x1496d40, 0xc004b59820, 0x0, 0x0, 0xc0091bf440, 0x0)
        /gazette/.build/go-path/pkg/mod/golang.org/x/net@v0.0.0-20200324143707-d3edc9973b7e/http2/frame.go:237 +0x87
golang.org/x/net/http2.(*Framer).ReadFrame(0xc001e66460, 0x14a4d60, 0xc0091bf440, 0x0, 0x0)
        /gazette/.build/go-path/pkg/mod/golang.org/x/net@v0.0.0-20200324143707-d3edc9973b7e/http2/frame.go:492 +0xa1
github.com/soheilhy/cmux.matchHTTP2Field(0x1497b80, 0xc0013b8020, 0x1496d40, 0xc004b59820, 0x121d9ea, 0xc, 0xc0054e9ec0, 0x415285)
        /gazette/.build/go-path/pkg/mod/github.com/soheilhy/cmux@v0.1.4/matchers.go:236 +0x148
github.com/soheilhy/cmux.HTTP2MatchHeaderFieldSendSettings.func1(0x1497b80, 0xc0013b8020, 0x1496d40, 0xc004b59820, 0xc0013b8020)
        /gazette/.build/go-path/pkg/mod/github.com/soheilhy/cmux@v0.1.4/matchers.go:173 +0xca
github.com/soheilhy/cmux.(*cMux).serve(0xc00012a4b0, 0x14c1ea0, 0xc0013b8020, 0xc00015c120, 0xc00066c000)
        /gazette/.build/go-path/pkg/mod/github.com/soheilhy/cmux@v0.1.4/cmux.go:184 +0x1d3
created by github.com/soheilhy/cmux.(*cMux).Serve
        /gazette/.build/go-path/pkg/mod/github.com/soheilhy/cmux@v0.1.4/cmux.go:171 +0x19d

Meanwhile, gRPC Serve() is blocked waiting for Accept to return. It must do so before it can notify the gRPC server’s own WG, which is a prerequisite for GracefulStop or Stop to return:

goroutine 1800 [chan receive, 5 minutes]:
github.com/soheilhy/cmux.muxListener.Accept(...)
        /gazette/.build/go-path/pkg/mod/github.com/soheilhy/cmux@v0.1.4/cmux.go:229
google.golang.org/grpc.(*Server).Serve(0xc0001661a0, 0x14b2060, 0xc000113820, 0x0, 0x0)
        /gazette/.build/go-path/pkg/mod/google.golang.org/grpc@v1.28.0/server.go:621 +0x210
go.gazette.dev/core/server.(*Server).QueueTasks.func3(0x0, 0x1)
        /gazette/server/server.go:136 +0x44
go.gazette.dev/core/task.(*Group).GoRun.func1(0x0, 0x0)
        /gazette/task/group.go:72 +0x43
golang.org/x/sync/errgroup.(*Group).Go.func1(0xc000530a80, 0xc0013900c0)
        /gazette/.build/go-path/pkg/mod/golang.org/x/sync@v0.0.0-20200317015054-43a5402ce75a/errgroup/errgroup.go:57 +0x59
created by golang.org/x/sync/errgroup.(*Group).Go
        /gazette/.build/go-path/pkg/mod/golang.org/x/sync@v0.0.0-20200317015054-43a5402ce75a/errgroup/errgroup.go:54 +0x66

For completeness, here's where GracefulStop is wedged waiting on it's WG, held hostage by grpc.Serve:

goroutine 1802 [semacquire, 5 minutes]:
sync.runtime_Semacquire(0xc00016631c)
        /usr/local/go/src/runtime/sema.go:56 +0x42
sync.(*WaitGroup).Wait(0xc00016631c)
        /usr/local/go/src/sync/waitgroup.go:130 +0x64
google.golang.org/grpc.(*Server).GracefulStop(0xc0001661a0)
        /gazette/.build/go-path/pkg/mod/google.golang.org/grpc@v1.28.0/server.go:1551 +0x1b1
go.gazette.dev/core/broker.(*Service).QueueTasks.func2(0xc00030eb90, 0x50)
        /gazette/broker/service.go:71 +0xa7
go.gazette.dev/core/task.(*Group).GoRun.func1(0x14b6660, 0xc0004b03c0)
        /gazette/task/group.go:72 +0x43
golang.org/x/sync/errgroup.(*Group).Go.func1(0xc000530a80, 0xc001390100)
        /gazette/.build/go-path/pkg/mod/golang.org/x/sync@v0.0.0-20200317015054-43a5402ce75a/errgroup/errgroup.go:57 +0x59
created by golang.org/x/sync/errgroup.(*Group).Go
        /gazette/.build/go-path/pkg/mod/golang.org/x/sync@v0.0.0-20200317015054-43a5402ce75a/errgroup/errgroup.go:54 +0x66
@jgraettinger
Copy link
Author

An effective mitigation of this behavior is to set a non-zero SetReadTimeout. That timeout applies only while sniffing an appropriate mux, just after initial connection

jgraettinger added a commit to jgraettinger/gazette that referenced this issue Jun 15, 2020
Setting a read timeout mitigates the impact of
soheilhy/cmux#76

Close GRPCLoopback on hard stop to guard against the possibility of
a remote peer which responds to TCP keep-alive but isn't reading
from the connection.
@brentyates
Copy link

Would it be reasonable to add a wg.Done() before returning the error on line https://github.com/soheilhy/cmux/blob/master/cmux.go#L165? For me this solution seems to work, but I do not know what other potential problems this might cause.

@brentyates
Copy link

Adding the wg.Done() only helps me I guess because I know that the for loop inside serve ran at least once first.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants