sql: deadlock when shutting down SQL instance #71292

jaylim-crl · 2021-10-07T18:49:07Z

Describe the problem

Related to #66600.

We currently terminate the SQL process whenever a SQL instance session expires.
At the moment when that happens, the stopper gets stuck in the quiescing state,
and the process will appear to stuck until it is forcibly terminated through a signal.

The root cause of this problem is that we're calling stopper.Stop within an async
task. During the quiescing phase, the stopper waits for all tasks to terminate before
returning, but that will never happen because the goroutine that called stopper.Stop
is an async task itself.

heartbeatLoop is registered as an async task:

cockroach/pkg/sql/sqlliveness/slinstance/slinstance.go

Line 294 in 617beac

_ = l.stopper.RunAsyncTask(ctx, "slinstance", l.heartbeatLoop)

When a session expires, we invoke all the registered callbacks:

cockroach/pkg/sql/sqlliveness/slinstance/slinstance.go

Lines 137 to 147 in 617beac

 func (l *Instance) clearSession(ctx context.Context) { 

 l.mu.Lock() 

 defer l.mu.Unlock() 

 if expiration := l.mu.s.Expiration(); expiration.Less(l.clock.Now()) { 

 // If the session has expired, invoke the session expiry callbacks 

 // associated with the session. 

 l.mu.s.invokeSessionExpiryCallbacks(ctx) 

 } 

 l.mu.s = nil 

 l.mu.blockCh = make(chan struct{}) 

 }

The registered callbacks, which happens only to be shutdownSQLInstance, invoke
stopper.Stop to tell the process to terminate:

cockroach/pkg/sql/sqlinstance/instanceprovider/instanceprovider.go

Line 143 in 9208567

p.stopper.Stop(ctx)

A side effect to this is that we can also get deadlocks on mutexes. Before invoking
the callbacks, we grab l.mu as shown above. But l.mu is needed to obtain a
live session from the SQL instance:

cockroach/pkg/sql/sqlliveness/slinstance/slinstance.go

Lines 298 to 312 in 9208567

 // Session returns a live session id. For each Sqlliveness instance the 

 // invariant is that there exists at most one live session at any point in time. 

 // If the current one has expired then a new one is created. 

 func (l *Instance) Session(ctx context.Context) (sqlliveness.Session, error) { 

 if l.testKnobs.SessionOverride != nil { 

 if s, err := l.testKnobs.SessionOverride(ctx); s != nil || err != nil { 

 return s, err 

 } 

 } 

 l.mu.Lock() 

 if !l.mu.started { 

 l.mu.Unlock() 

 return nil, sqlliveness.NotStartedError 

 } 

 l.mu.Unlock()

Async tasks that go into the stopper can come in many forms. Based on my
experiments, the problematic ones that I've seen are:

cockroach/pkg/sql/catalog/lease/lease.go

Lines 384 to 391 in 617beac

 // Release to the store asynchronously, without the descriptorState lock. 

 if err := m.stopper.RunAsyncTask( 

 ctx, "sql.descriptorState: releasing descriptor lease", 

 func(ctx context.Context) { 

 m.storage.release(ctx, m.stopper, lease) 

 }); err != nil { 

 log.Warningf(ctx, "error: %s, not releasing lease: %q", err, lease) 

 }

cockroach/pkg/jobs/registry.go

Lines 887 to 911 in 617beac

 return stopper.RunAsyncTask(context.Background(), "jobs/adopt", func(ctx context.Context) { 

 ctx, cancel := stopper.WithCancelOnQuiesce(ctx) 

 defer cancel() 

 lc, cleanup := makeLoopController(r.settings, adoptIntervalSetting, r.knobs.IntervalOverrides.Adopt) 

 defer cleanup() 

 for { 

 select { 

 case <-lc.updated: 

 lc.onUpdate() 

 case <-stopper.ShouldQuiesce(): 

 return 

 case shouldClaim := <-r.adoptionCh: 

 // Try to adopt the most recently created job. 

 if shouldClaim { 

 claimJobs(ctx) 

 } 

 processClaimedJobs(ctx) 

 case <-lc.timer.C: 

 lc.timer.Read = true 

 claimJobs(ctx) 

 processClaimedJobs(ctx) 

 lc.onExecute() 

 } 

 } 

 })

Both of these will eventually try to obtain a live session through the
sqlinstance.Session here and here.

These two tasks may be present in the stopper when it stops as a result of the
expiry of a SQL instance session. The code to stop the process grabs l.mu,
but these tasks require l.mu to proceed, so we have a deadlock.

To Reproduce

Trying to reproduce a situation where only one task left (root caller) in the stopper
is difficult. However, a deadlock scenario on the mutexes can easily be reproduced
by running TPCH queries with a high concurrency for a few minutes on the SQL pod.
This assume a tenant with just one SQL pod, so all queries will be hitting it.

cockroach workload run tpch '<URL>'  --queries 1 --concurrency=1024 --tolerate-errors

The query above is going to cause overload on the SQL pod with a memory limit of 16GB.
When running that query, current SQL memory ramps up to hit 4GB, which is 25% of
16GB based on the --max-sql-memory flag.

Expected behavior

The SQL pod should terminate properly whenever a SQL instance session has expired.

Additional context

Some slack conversation can be found here: https://cockroachlabs.slack.com/archives/C01CDD4HRC5/p1633564911120200. I did not post logs as they contain CC specific information. This is causing instability in Serverless within the custom autoscaler logic because the pods will not shutdown. Marking it as a release-blocker as per @andy-kimball.

/cc @rimadeodhar @ajwerner @dhartunian

The text was updated successfully, but these errors were encountered:

jaylim-crl · 2021-10-07T18:52:14Z

One solution that was discussed is to run callbacks here asynchronously in goroutines. I'm not super familiar with the code, and I don't know if that is safe. Running callbacks in goroutines will solve the issue above because the root caller will terminate so the heartbeatLoop task in the stopper can finish, and it won't cause deadlocks because returning from the root caller will release l.mu, allowing tasks that require it to proceed.

ajwerner · 2021-10-11T15:15:41Z

We should just have the callback launch a goroutine.

ajwerner · 2021-10-11T15:18:15Z

I think if we put. a go here, it'll fix our problem:

cockroach/pkg/sql/sqlinstance/instanceprovider/instanceprovider.go

Line 143 in 9208567

p.stopper.Stop(ctx)

dhartunian · 2021-10-11T15:26:50Z

@ajwerner i was going to add a go prior to the callback calls here:

cockroach/pkg/sql/sqlliveness/slinstance/slinstance.go

Line 89 in 76b02ba

callback(ctx)

Do you think it's better to just put it inside the shutdownSQLInstance callback only?

Previously, the sqlliveness session expiry callbacks were called in the heartbeatLoop thread which executed the renew/expiry logic. This could cause deadlock since session expiration is used to trigger a shutdown of the SQL instance via `stopper.Stop()`. The stopper would wait for all async tasks to quiesce, but the `heartbeatLoop` would continue, waiting for the callbacks to finish running. In addition, this task would hold a lock on `l.mu` while waiting for the callbacks to run causing other threads to wait if they needed to retrieve the session. This change invokes each callback in its own goroutine to prevent this deadlock. Resolves cockroachdb#71292 Release note: None

71140: sql: NFC-normalization on double-quoted identifiers r=rafiss a=e-mbrown Resolves: #55396 There was a lack of NFC-normalization on double-quoted identifiers. This allowed for ambiguous table/db/column names. This change adds normalization and prevents this case. Release note: None 71175: ui/cluster-ui: make app name a query search parameter in stmts page r=maryliag,matthewtodd a=xinhaoz Fixes: #70790 Previously, the selected app was derived from a route param on the statements page. All other filters are derived from query search parameters on the page. This commit makes the app name a query search parameter, as is the case in the transactions page. Release note: None 71405: vendor: bump Pebble to 59007c613a66 r=jbowens a=nicktrav ``` 59007c61 sstable: consolidate checksum types 17635b0a *: address staticcheck lints 11823273 *: address staticcheck lint check U1000 807abfe8 *: address staticcheck lint check S1039 d24dd342 all: remove some unused code 3bdca93a sstable: clarify comment on checksum input 926d23e9 pebble: remove comment related to batching from the table cache 0a6177ae db: tweak ErrClosed documentation ecc685b2 db: expose facility for retrieving batch count b2eb88a7 sstable: remove unused rawBlockIter err field ``` Release note: None. 71419: sqlliveness: session expiry callbacks must be async r=ajwerner,jaylim-crl a=dhartunian Previously, the sqlliveness session expiry callbacks were called in the heartbeatLoop thread which executed the renew/expiry logic. This could cause deadlock since session expiration is used to trigger a shutdown of the SQL instance via `stopper.Stop()`. The stopper would wait for all async tasks to quiesce, but the `heartbeatLoop` would continue, waiting for the callbacks to finish running. In addition, this task would hold a lock on `l.mu` while waiting for the callbacks to run causing other threads to wait if they needed to retrieve the session. This change invokes each callback in its own goroutine to prevent this deadlock. Resolves #71292 Release note: None Co-authored-by: e-mbrown <ebsonari@gmail.com> Co-authored-by: Xin Hao Zhang <xzhang@cockroachlabs.com> Co-authored-by: Nick Travers <travers@cockroachlabs.com> Co-authored-by: David Hartunian <davidh@cockroachlabs.com>

jaylim-crl · 2021-10-12T02:53:53Z

Thanks @dhartunian!

Previously, the sqlliveness session expiry callbacks were called in the heartbeatLoop thread which executed the renew/expiry logic. This could cause deadlock since session expiration is used to trigger a shutdown of the SQL instance via `stopper.Stop()`. The stopper would wait for all async tasks to quiesce, but the `heartbeatLoop` would continue, waiting for the callbacks to finish running. In addition, this task would hold a lock on `l.mu` while waiting for the callbacks to run causing other threads to wait if they needed to retrieve the session. This change invokes each callback in its own goroutine to prevent this deadlock. Resolves cockroachdb#71292 Release note: None

Previously, the sqlliveness session expiry callbacks were called in the heartbeatLoop thread which executed the renew/expiry logic. This could cause deadlock since session expiration is used to trigger a shutdown of the SQL instance via `stopper.Stop()`. The stopper would wait for all async tasks to quiesce, but the `heartbeatLoop` would continue, waiting for the callbacks to finish running. In addition, this task would hold a lock on `l.mu` while waiting for the callbacks to run causing other threads to wait if they needed to retrieve the session. This change invokes each callback in its own goroutine to prevent this deadlock. Resolves #71292 Release note: None

Previously, the sqlliveness session expiry callbacks were called in the heartbeatLoop thread which executed the renew/expiry logic. This could cause deadlock since session expiration is used to trigger a shutdown of the SQL instance via `stopper.Stop()`. The stopper would wait for all async tasks to quiesce, but the `heartbeatLoop` would continue, waiting for the callbacks to finish running. In addition, this task would hold a lock on `l.mu` while waiting for the callbacks to run causing other threads to wait if they needed to retrieve the session. This change invokes each callback in its own goroutine to prevent this deadlock. Resolves cockroachdb#71292 Release note: None

Clients can register callbacks with a sqlliveness instance, to be run on instance shutdown. These callbacks started off as sync, but were made async in cockroachdb#71419 in order to fix the deadlock described in cockroachdb#71292: one of the callbacks wants to stop the stopper, but the callbacks were being run inside a stopper task. Running all callbacks in goroutine was a less than ideal fix, because the async callbacks were getting access to the caller's tracing span. That span might get finished by the time the callback runs (use-after-Finish). Generally speaking, spawning a goroutine is not something to be done lightly. Usually, some form of structured concurrency should be used to prevent exactly this kind of problems. Spawning goroutines through stopper tasks is one example of a concurrency structure, but it cannot be used in this particular case. Modules that don't really need concurrency should better avoid it, and the sqlliveness instance does not need really concurrency. Instead, this patch moves the burden of avoiding the deadlock to the one callback that needs it. That guy now spawns a goroutine, and it takes care to not capture the caller's span. This patch also moves the testing against the deadlock closer to where the deadlock was occuring, in sqlinstance/instanceprovider. Release note: None

73693: sqlliveness,sqlinstance: move back to sync callbacks r=andreimatei a=andreimatei Clients can register callbacks with a sqlliveness instance, to be run on instance shutdown. These callbacks started off as sync, but were made async in #71419 in order to fix the deadlock described in #71292: one of the callbacks wants to stop the stopper, but the callbacks were being run inside a stopper task. Running all callbacks in goroutine was a less than ideal fix, because the async callbacks were getting access to the caller's tracing span. That span might get finished by the time the callback runs (use-after-Finish). Generally speaking, spawning a goroutine is not something to be done lightly. Usually, some form of structured concurrency should be used to prevent exactly this kind of problems. Spawning goroutines through stopper tasks is one example of a concurrency structure, but it cannot be used in this particular case. Modules that don't really need concurrency should better avoid it, and the sqlliveness instance does not need really concurrency. Instead, this patch moves the burden of avoiding the deadlock to the one callback that needs it. That guy now spawns a goroutine, and it takes care to not capture the caller's span. This patch also moves the testing against the deadlock closer to where the deadlock was occuring, in sqlinstance/instanceprovider. Release note: None Co-authored-by: Andrei Matei <andrei@cockroachlabs.com>

jaylim-crl added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. branch-release-21.2 labels Oct 7, 2021

jaylim-crl assigned dhartunian Oct 7, 2021

dhartunian mentioned this issue Oct 11, 2021

sqlliveness: session expiry callbacks must be async #71419

Merged

dhartunian mentioned this issue Oct 11, 2021

release-21.2: sqlliveness: session expiry callbacks must be async #71433

Merged

dhartunian added the T-observability-inf label Oct 12, 2021

craig bot closed this as completed in 3586291 Oct 12, 2021

andreimatei mentioned this issue Dec 10, 2021

sqlliveness,sqlinstance: move back to sync callbacks #73693

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sql: deadlock when shutting down SQL instance #71292

sql: deadlock when shutting down SQL instance #71292

jaylim-crl commented Oct 7, 2021

jaylim-crl commented Oct 7, 2021

ajwerner commented Oct 11, 2021

ajwerner commented Oct 11, 2021

dhartunian commented Oct 11, 2021

jaylim-crl commented Oct 12, 2021

sql: deadlock when shutting down SQL instance #71292

sql: deadlock when shutting down SQL instance #71292

Comments

jaylim-crl commented Oct 7, 2021

jaylim-crl commented Oct 7, 2021

ajwerner commented Oct 11, 2021

ajwerner commented Oct 11, 2021

dhartunian commented Oct 11, 2021

jaylim-crl commented Oct 12, 2021