-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sql: deadlock when shutting down SQL instance #71292
Comments
One solution that was discussed is to run callbacks here asynchronously in goroutines. I'm not super familiar with the code, and I don't know if that is safe. Running callbacks in goroutines will solve the issue above because the root caller will terminate so the |
We should just have the callback launch a goroutine. |
I think if we put. a
|
@ajwerner i was going to add a
Do you think it's better to just put it inside the |
Previously, the sqlliveness session expiry callbacks were called in the heartbeatLoop thread which executed the renew/expiry logic. This could cause deadlock since session expiration is used to trigger a shutdown of the SQL instance via `stopper.Stop()`. The stopper would wait for all async tasks to quiesce, but the `heartbeatLoop` would continue, waiting for the callbacks to finish running. In addition, this task would hold a lock on `l.mu` while waiting for the callbacks to run causing other threads to wait if they needed to retrieve the session. This change invokes each callback in its own goroutine to prevent this deadlock. Resolves cockroachdb#71292 Release note: None
Previously, the sqlliveness session expiry callbacks were called in the heartbeatLoop thread which executed the renew/expiry logic. This could cause deadlock since session expiration is used to trigger a shutdown of the SQL instance via `stopper.Stop()`. The stopper would wait for all async tasks to quiesce, but the `heartbeatLoop` would continue, waiting for the callbacks to finish running. In addition, this task would hold a lock on `l.mu` while waiting for the callbacks to run causing other threads to wait if they needed to retrieve the session. This change invokes each callback in its own goroutine to prevent this deadlock. Resolves cockroachdb#71292 Release note: None
Previously, the sqlliveness session expiry callbacks were called in the heartbeatLoop thread which executed the renew/expiry logic. This could cause deadlock since session expiration is used to trigger a shutdown of the SQL instance via `stopper.Stop()`. The stopper would wait for all async tasks to quiesce, but the `heartbeatLoop` would continue, waiting for the callbacks to finish running. In addition, this task would hold a lock on `l.mu` while waiting for the callbacks to run causing other threads to wait if they needed to retrieve the session. This change invokes each callback in its own goroutine to prevent this deadlock. Resolves cockroachdb#71292 Release note: None
Previously, the sqlliveness session expiry callbacks were called in the heartbeatLoop thread which executed the renew/expiry logic. This could cause deadlock since session expiration is used to trigger a shutdown of the SQL instance via `stopper.Stop()`. The stopper would wait for all async tasks to quiesce, but the `heartbeatLoop` would continue, waiting for the callbacks to finish running. In addition, this task would hold a lock on `l.mu` while waiting for the callbacks to run causing other threads to wait if they needed to retrieve the session. This change invokes each callback in its own goroutine to prevent this deadlock. Resolves cockroachdb#71292 Release note: None
Previously, the sqlliveness session expiry callbacks were called in the heartbeatLoop thread which executed the renew/expiry logic. This could cause deadlock since session expiration is used to trigger a shutdown of the SQL instance via `stopper.Stop()`. The stopper would wait for all async tasks to quiesce, but the `heartbeatLoop` would continue, waiting for the callbacks to finish running. In addition, this task would hold a lock on `l.mu` while waiting for the callbacks to run causing other threads to wait if they needed to retrieve the session. This change invokes each callback in its own goroutine to prevent this deadlock. Resolves cockroachdb#71292 Release note: None
71140: sql: NFC-normalization on double-quoted identifiers r=rafiss a=e-mbrown Resolves: #55396 There was a lack of NFC-normalization on double-quoted identifiers. This allowed for ambiguous table/db/column names. This change adds normalization and prevents this case. Release note: None 71175: ui/cluster-ui: make app name a query search parameter in stmts page r=maryliag,matthewtodd a=xinhaoz Fixes: #70790 Previously, the selected app was derived from a route param on the statements page. All other filters are derived from query search parameters on the page. This commit makes the app name a query search parameter, as is the case in the transactions page. Release note: None 71405: vendor: bump Pebble to 59007c613a66 r=jbowens a=nicktrav ``` 59007c61 sstable: consolidate checksum types 17635b0a *: address staticcheck lints 11823273 *: address staticcheck lint check U1000 807abfe8 *: address staticcheck lint check S1039 d24dd342 all: remove some unused code 3bdca93a sstable: clarify comment on checksum input 926d23e9 pebble: remove comment related to batching from the table cache 0a6177ae db: tweak ErrClosed documentation ecc685b2 db: expose facility for retrieving batch count b2eb88a7 sstable: remove unused rawBlockIter err field ``` Release note: None. 71419: sqlliveness: session expiry callbacks must be async r=ajwerner,jaylim-crl a=dhartunian Previously, the sqlliveness session expiry callbacks were called in the heartbeatLoop thread which executed the renew/expiry logic. This could cause deadlock since session expiration is used to trigger a shutdown of the SQL instance via `stopper.Stop()`. The stopper would wait for all async tasks to quiesce, but the `heartbeatLoop` would continue, waiting for the callbacks to finish running. In addition, this task would hold a lock on `l.mu` while waiting for the callbacks to run causing other threads to wait if they needed to retrieve the session. This change invokes each callback in its own goroutine to prevent this deadlock. Resolves #71292 Release note: None Co-authored-by: e-mbrown <ebsonari@gmail.com> Co-authored-by: Xin Hao Zhang <xzhang@cockroachlabs.com> Co-authored-by: Nick Travers <travers@cockroachlabs.com> Co-authored-by: David Hartunian <davidh@cockroachlabs.com>
Thanks @dhartunian! |
Previously, the sqlliveness session expiry callbacks were called in the heartbeatLoop thread which executed the renew/expiry logic. This could cause deadlock since session expiration is used to trigger a shutdown of the SQL instance via `stopper.Stop()`. The stopper would wait for all async tasks to quiesce, but the `heartbeatLoop` would continue, waiting for the callbacks to finish running. In addition, this task would hold a lock on `l.mu` while waiting for the callbacks to run causing other threads to wait if they needed to retrieve the session. This change invokes each callback in its own goroutine to prevent this deadlock. Resolves cockroachdb#71292 Release note: None
Previously, the sqlliveness session expiry callbacks were called in the heartbeatLoop thread which executed the renew/expiry logic. This could cause deadlock since session expiration is used to trigger a shutdown of the SQL instance via `stopper.Stop()`. The stopper would wait for all async tasks to quiesce, but the `heartbeatLoop` would continue, waiting for the callbacks to finish running. In addition, this task would hold a lock on `l.mu` while waiting for the callbacks to run causing other threads to wait if they needed to retrieve the session. This change invokes each callback in its own goroutine to prevent this deadlock. Resolves #71292 Release note: None
Previously, the sqlliveness session expiry callbacks were called in the heartbeatLoop thread which executed the renew/expiry logic. This could cause deadlock since session expiration is used to trigger a shutdown of the SQL instance via `stopper.Stop()`. The stopper would wait for all async tasks to quiesce, but the `heartbeatLoop` would continue, waiting for the callbacks to finish running. In addition, this task would hold a lock on `l.mu` while waiting for the callbacks to run causing other threads to wait if they needed to retrieve the session. This change invokes each callback in its own goroutine to prevent this deadlock. Resolves cockroachdb#71292 Release note: None
Clients can register callbacks with a sqlliveness instance, to be run on instance shutdown. These callbacks started off as sync, but were made async in cockroachdb#71419 in order to fix the deadlock described in cockroachdb#71292: one of the callbacks wants to stop the stopper, but the callbacks were being run inside a stopper task. Running all callbacks in goroutine was a less than ideal fix, because the async callbacks were getting access to the caller's tracing span. That span might get finished by the time the callback runs (use-after-Finish). Generally speaking, spawning a goroutine is not something to be done lightly. Usually, some form of structured concurrency should be used to prevent exactly this kind of problems. Spawning goroutines through stopper tasks is one example of a concurrency structure, but it cannot be used in this particular case. Modules that don't really need concurrency should better avoid it, and the sqlliveness instance does not need really concurrency. Instead, this patch moves the burden of avoiding the deadlock to the one callback that needs it. That guy now spawns a goroutine, and it takes care to not capture the caller's span. This patch also moves the testing against the deadlock closer to where the deadlock was occuring, in sqlinstance/instanceprovider. Release note: None
Clients can register callbacks with a sqlliveness instance, to be run on instance shutdown. These callbacks started off as sync, but were made async in cockroachdb#71419 in order to fix the deadlock described in cockroachdb#71292: one of the callbacks wants to stop the stopper, but the callbacks were being run inside a stopper task. Running all callbacks in goroutine was a less than ideal fix, because the async callbacks were getting access to the caller's tracing span. That span might get finished by the time the callback runs (use-after-Finish). Generally speaking, spawning a goroutine is not something to be done lightly. Usually, some form of structured concurrency should be used to prevent exactly this kind of problems. Spawning goroutines through stopper tasks is one example of a concurrency structure, but it cannot be used in this particular case. Modules that don't really need concurrency should better avoid it, and the sqlliveness instance does not need really concurrency. Instead, this patch moves the burden of avoiding the deadlock to the one callback that needs it. That guy now spawns a goroutine, and it takes care to not capture the caller's span. This patch also moves the testing against the deadlock closer to where the deadlock was occuring, in sqlinstance/instanceprovider. Release note: None
Clients can register callbacks with a sqlliveness instance, to be run on instance shutdown. These callbacks started off as sync, but were made async in cockroachdb#71419 in order to fix the deadlock described in cockroachdb#71292: one of the callbacks wants to stop the stopper, but the callbacks were being run inside a stopper task. Running all callbacks in goroutine was a less than ideal fix, because the async callbacks were getting access to the caller's tracing span. That span might get finished by the time the callback runs (use-after-Finish). Generally speaking, spawning a goroutine is not something to be done lightly. Usually, some form of structured concurrency should be used to prevent exactly this kind of problems. Spawning goroutines through stopper tasks is one example of a concurrency structure, but it cannot be used in this particular case. Modules that don't really need concurrency should better avoid it, and the sqlliveness instance does not need really concurrency. Instead, this patch moves the burden of avoiding the deadlock to the one callback that needs it. That guy now spawns a goroutine, and it takes care to not capture the caller's span. This patch also moves the testing against the deadlock closer to where the deadlock was occuring, in sqlinstance/instanceprovider. Release note: None
73693: sqlliveness,sqlinstance: move back to sync callbacks r=andreimatei a=andreimatei Clients can register callbacks with a sqlliveness instance, to be run on instance shutdown. These callbacks started off as sync, but were made async in #71419 in order to fix the deadlock described in #71292: one of the callbacks wants to stop the stopper, but the callbacks were being run inside a stopper task. Running all callbacks in goroutine was a less than ideal fix, because the async callbacks were getting access to the caller's tracing span. That span might get finished by the time the callback runs (use-after-Finish). Generally speaking, spawning a goroutine is not something to be done lightly. Usually, some form of structured concurrency should be used to prevent exactly this kind of problems. Spawning goroutines through stopper tasks is one example of a concurrency structure, but it cannot be used in this particular case. Modules that don't really need concurrency should better avoid it, and the sqlliveness instance does not need really concurrency. Instead, this patch moves the burden of avoiding the deadlock to the one callback that needs it. That guy now spawns a goroutine, and it takes care to not capture the caller's span. This patch also moves the testing against the deadlock closer to where the deadlock was occuring, in sqlinstance/instanceprovider. Release note: None Co-authored-by: Andrei Matei <andrei@cockroachlabs.com>
Describe the problem
Related to #66600.
We currently terminate the SQL process whenever a SQL instance session expires.
At the moment when that happens, the stopper gets stuck in the quiescing state,
and the process will appear to stuck until it is forcibly terminated through a signal.
The root cause of this problem is that we're calling
stopper.Stop
within an asynctask. During the quiescing phase, the stopper waits for all tasks to terminate before
returning, but that will never happen because the goroutine that called
stopper.Stop
is an async task itself.
heartbeatLoop
is registered as an async task:cockroach/pkg/sql/sqlliveness/slinstance/slinstance.go
Line 294 in 617beac
When a session expires, we invoke all the registered callbacks:
cockroach/pkg/sql/sqlliveness/slinstance/slinstance.go
Lines 137 to 147 in 617beac
The registered callbacks, which happens only to be
shutdownSQLInstance
, invokestopper.Stop
to tell the process to terminate:cockroach/pkg/sql/sqlinstance/instanceprovider/instanceprovider.go
Line 143 in 9208567
A side effect to this is that we can also get deadlocks on mutexes. Before invoking
the callbacks, we grab
l.mu
as shown above. Butl.mu
is needed to obtain alive session from the SQL instance:
cockroach/pkg/sql/sqlliveness/slinstance/slinstance.go
Lines 298 to 312 in 9208567
Async tasks that go into the stopper can come in many forms. Based on my
experiments, the problematic ones that I've seen are:
cockroach/pkg/sql/catalog/lease/lease.go
Lines 384 to 391 in 617beac
cockroach/pkg/jobs/registry.go
Lines 887 to 911 in 617beac
Both of these will eventually try to obtain a live session through the
sqlinstance.Session
here and here.These two tasks may be present in the stopper when it stops as a result of the
expiry of a SQL instance session. The code to stop the process grabs
l.mu
,but these tasks require
l.mu
to proceed, so we have a deadlock.To Reproduce
Trying to reproduce a situation where only one task left (root caller) in the stopper
is difficult. However, a deadlock scenario on the mutexes can easily be reproduced
by running TPCH queries with a high concurrency for a few minutes on the SQL pod.
This assume a tenant with just one SQL pod, so all queries will be hitting it.
The query above is going to cause overload on the SQL pod with a memory limit of 16GB.
When running that query, current SQL memory ramps up to hit 4GB, which is 25% of
16GB based on the
--max-sql-memory
flag.Expected behavior
The SQL pod should terminate properly whenever a SQL instance session has expired.
Additional context
Some slack conversation can be found here: https://cockroachlabs.slack.com/archives/C01CDD4HRC5/p1633564911120200. I did not post logs as they contain CC specific information. This is causing instability in Serverless within the custom autoscaler logic because the pods will not shutdown. Marking it as a
release-blocker
as per @andy-kimball./cc @rimadeodhar @ajwerner @dhartunian
The text was updated successfully, but these errors were encountered: