Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sqlproxyccl: potential deadlock in TestDirectoryServer #71365

Closed
erikgrinaker opened this issue Oct 9, 2021 · 6 comments
Closed

sqlproxyccl: potential deadlock in TestDirectoryServer #71365

erikgrinaker opened this issue Oct 9, 2021 · 6 comments
Assignees
Labels
A-cc-enablement Pertains to current CC production issues or short-term projects C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. skipped-test T-serverless

Comments

@erikgrinaker
Copy link
Contributor

erikgrinaker commented Oct 9, 2021

A number of the sqlproxyccl tests trigger the deadlock detector for TestDirectoryServer.

To reproduce, remove the skip.UnderDeadlock calls from the tests and run:

$ make testdeadlock PKG=./pkg/ccl/sqlproxyccl/...
[...]
POTENTIAL DEADLOCK: Inconsistent locking. saw this ordering in one goroutine:
happened before
tenantdirsvr/test_directory_svr.go:297 tenantdirsvr.(*TestDirectoryServer).EnsurePod { s.proc.Lock() } <<<<<
tenantdirsvr/test_directory_svr.go:296 tenantdirsvr.(*TestDirectoryServer).EnsurePod {  }
tenant/directory.pb.go:646 tenant._Directory_EnsurePod_Handler { if interceptor == nil { }
../../../vendor/google.golang.org/grpc/server.go:1297 grpc.(*Server).processUnaryRPC { reply, appErr := md.Handler(info.serviceImpl, ctx, df, s.opts.unaryInt) }
../../../vendor/google.golang.org/grpc/server.go:1626 grpc.(*Server).handleStream { s.processUnaryRPC(t, stream, srv, md, trInfo) }
../../../vendor/google.golang.org/grpc/server.go:941 grpc.(*Server).serveStreams.func1.2 { s.handleStream(st, stream, s.traceInfo(st, stream)) }

happened after
tenantdirsvr/test_directory_svr.go:58 tenantdirsvr.NewSubStopper { mu.Lock() } <<<<<
../../../vendor/github.com/sasha-s/go-deadlock/deadlock.go:84 go-deadlock.(*Mutex).Lock { func (m *Mutex) Lock() { }
proxy_handler_test.go:1142 sqlproxyccl.newDirectoryServer.func2 { log.TestingClearServerIdentifiers() }
tenantdirsvr/test_directory_svr.go:304 tenantdirsvr.(*TestDirectoryServer).EnsurePod { if len(lst.Pods) == 0 { }
tenant/directory.pb.go:646 tenant._Directory_EnsurePod_Handler { if interceptor == nil { }
../../../vendor/google.golang.org/grpc/server.go:1297 grpc.(*Server).processUnaryRPC { reply, appErr := md.Handler(info.serviceImpl, ctx, df, s.opts.unaryInt) }
../../../vendor/google.golang.org/grpc/server.go:1626 grpc.(*Server).handleStream { s.processUnaryRPC(t, stream, srv, md, trInfo) }
../../../vendor/google.golang.org/grpc/server.go:941 grpc.(*Server).serveStreams.func1.2 { s.handleStream(st, stream, s.traceInfo(st, stream)) }

in another goroutine: happened before
tenantdirsvr/test_directory_svr.go:51 tenantdirsvr.NewSubStopper.func1 { mu.Lock() } <<<<<
tenantdirsvr/test_directory_svr.go:50 tenantdirsvr.NewSubStopper.func1 { parentStopper.AddCloser(stop.CloserFn(func() { }
../../util/stop/stopper.go:108 stop.CloserFn.Close { f() }
../../util/stop/stopper.go:509 stop.(*Stopper).Stop { // any attempts to do so will be refused. }
proxy_handler_test.go:680 sqlproxyccl.TestDirectoryConnect { // since it's not allowed to jump to a different address. }

happened after
tenantdirsvr/test_directory_svr.go:363 tenantdirsvr.(*TestDirectoryServer).deregisterInstance { s.proc.Lock() } <<<<<
tenantdirsvr/test_directory_svr.go:362 tenantdirsvr.(*TestDirectoryServer).deregisterInstance { func (s *TestDirectoryServer) deregisterInstance(tenantID uint64, sql net.Addr) { }
tenantdirsvr/test_directory_svr.go:311 tenantdirsvr.(*TestDirectoryServer).EnsurePod.func1 { s.deregisterInstance(req.TenantID, process.SQL) }
../../util/stop/stopper.go:108 stop.CloserFn.Close { f() }
../../util/stop/stopper.go:509 stop.(*Stopper).Stop { // any attempts to do so will be refused. }
tenantdirsvr/test_directory_svr.go:56 tenantdirsvr.NewSubStopper.func1 { subStopper.Stop(context.Background()) }
../../util/stop/stopper.go:108 stop.CloserFn.Close { f() }
../../util/stop/stopper.go:509 stop.(*Stopper).Stop { // any attempts to do so will be refused. }
proxy_handler_test.go:680 sqlproxyccl.TestDirectoryConnect { // since it's not allowed to jump to a different address. }

Other goroutines holding locks:
goroutine 316642 lock 0xc000ae34a8
tenantdirsvr/test_directory_svr.go:51 tenantdirsvr.NewSubStopper.func1 { mu.Lock() } <<<<<
tenantdirsvr/test_directory_svr.go:50 tenantdirsvr.NewSubStopper.func1 { parentStopper.AddCloser(stop.CloserFn(func() { }
../../util/stop/stopper.go:108 stop.CloserFn.Close { f() }
../../util/stop/stopper.go:509 stop.(*Stopper).Stop { // any attempts to do so will be refused. }
proxy_handler_test.go:680 sqlproxyccl.TestDirectoryConnect { // since it's not allowed to jump to a different address. }

Jira issue: CRDB-10530

@erikgrinaker erikgrinaker added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. skipped-test A-cc-enablement Pertains to current CC production issues or short-term projects labels Oct 9, 2021
@erikgrinaker
Copy link
Contributor Author

@darinpp Pinging for visibility, since you show up in git blame for much of this code. Doesn't seem to be any way to mention or tag any of the CC teams from this repo.

@blathers-crl
Copy link

blathers-crl bot commented Jul 19, 2022

cc @cockroachdb/cdc

@shermanCRL
Copy link
Contributor

Do we see CDC in here somewhere?

@amruss
Copy link
Contributor

amruss commented Jul 27, 2022

Moving off of the CDC backlog for now, as it doesn't seem relevant to our team. Please add it back if we're wrong.

@knz knz added T-serverless and removed T-cdc labels Jul 28, 2022
@knz
Copy link
Contributor

knz commented Jul 28, 2022

definitely not CDC.

@jaylim-crl is this issue still relevant? Can we close it?

@jaylim-crl
Copy link
Collaborator

It is still relevant, but the tests have been skipped. We are aware that the TestDirectoryServer needs to be rewritten. I don't like the idea of calling out to the CRDB executable to start up tenant servers as well.

Closing this for now. The issue will still be around for future reference.

jaylim-crl added a commit to jaylim-crl/cockroach that referenced this issue Apr 20, 2023
Fixes cockroachdb#76839 and cockroachdb#71365.

This commit rewrites the test to use the static directory server instead of
the dynamic one. The old one is susceptible to GRPC port reuse and deadlock
issues. With this change, only directory_cache_test.go is relying on the
dynamic directory server. Once we rewrite that, we can remove the entire test
directory implementation.

Release note: None
craig bot pushed a commit that referenced this issue Apr 21, 2023
101864: ccl/sqlproxyccl: fixes flake on TestDirectoryConnect r=JeffSwenson a=jaylim-crl

Fixes #76839, #71365.

This commit rewrites the test to use the static directory server instead of the dynamic one. The old one is susceptible to GRPC port reuse and deadlock issues. With this change, only directory_cache_test.go is relying on the dynamic directory server. Once we rewrite that, we can remove the entire test directory implementation.

Release note: None

Co-authored-by: Jay <jay@cockroachlabs.com>
jaylim-crl added a commit to jaylim-crl/cockroach that referenced this issue May 22, 2023
Fixes cockroachdb#76839 and cockroachdb#71365.

This commit rewrites the test to use the static directory server instead of
the dynamic one. The old one is susceptible to GRPC port reuse and deadlock
issues. With this change, only directory_cache_test.go is relying on the
dynamic directory server. Once we rewrite that, we can remove the entire test
directory implementation.

Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-cc-enablement Pertains to current CC production issues or short-term projects C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. skipped-test T-serverless
Projects
None yet
Development

No branches or pull requests

6 participants