Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql: deadlock when shutting down SQL instance #71292

Closed
jaylim-crl opened this issue Oct 7, 2021 · 5 comments · Fixed by #71419
Closed

sql: deadlock when shutting down SQL instance #71292

jaylim-crl opened this issue Oct 7, 2021 · 5 comments · Fixed by #71419
Assignees
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked.

Comments

@jaylim-crl
Copy link
Collaborator

Describe the problem

Related to #66600.

We currently terminate the SQL process whenever a SQL instance session expires.
At the moment when that happens, the stopper gets stuck in the quiescing state,
and the process will appear to stuck until it is forcibly terminated through a signal.

The root cause of this problem is that we're calling stopper.Stop within an async
task. During the quiescing phase, the stopper waits for all tasks to terminate before
returning, but that will never happen because the goroutine that called stopper.Stop
is an async task itself.

heartbeatLoop is registered as an async task:

_ = l.stopper.RunAsyncTask(ctx, "slinstance", l.heartbeatLoop)

When a session expires, we invoke all the registered callbacks:

func (l *Instance) clearSession(ctx context.Context) {
l.mu.Lock()
defer l.mu.Unlock()
if expiration := l.mu.s.Expiration(); expiration.Less(l.clock.Now()) {
// If the session has expired, invoke the session expiry callbacks
// associated with the session.
l.mu.s.invokeSessionExpiryCallbacks(ctx)
}
l.mu.s = nil
l.mu.blockCh = make(chan struct{})
}

The registered callbacks, which happens only to be shutdownSQLInstance, invoke
stopper.Stop to tell the process to terminate:

A side effect to this is that we can also get deadlocks on mutexes. Before invoking
the callbacks, we grab l.mu as shown above. But l.mu is needed to obtain a
live session from the SQL instance:

// Session returns a live session id. For each Sqlliveness instance the
// invariant is that there exists at most one live session at any point in time.
// If the current one has expired then a new one is created.
func (l *Instance) Session(ctx context.Context) (sqlliveness.Session, error) {
if l.testKnobs.SessionOverride != nil {
if s, err := l.testKnobs.SessionOverride(ctx); s != nil || err != nil {
return s, err
}
}
l.mu.Lock()
if !l.mu.started {
l.mu.Unlock()
return nil, sqlliveness.NotStartedError
}
l.mu.Unlock()

Async tasks that go into the stopper can come in many forms. Based on my
experiments, the problematic ones that I've seen are:

  1. // Release to the store asynchronously, without the descriptorState lock.
    if err := m.stopper.RunAsyncTask(
    ctx, "sql.descriptorState: releasing descriptor lease",
    func(ctx context.Context) {
    m.storage.release(ctx, m.stopper, lease)
    }); err != nil {
    log.Warningf(ctx, "error: %s, not releasing lease: %q", err, lease)
    }
  2. return stopper.RunAsyncTask(context.Background(), "jobs/adopt", func(ctx context.Context) {
    ctx, cancel := stopper.WithCancelOnQuiesce(ctx)
    defer cancel()
    lc, cleanup := makeLoopController(r.settings, adoptIntervalSetting, r.knobs.IntervalOverrides.Adopt)
    defer cleanup()
    for {
    select {
    case <-lc.updated:
    lc.onUpdate()
    case <-stopper.ShouldQuiesce():
    return
    case shouldClaim := <-r.adoptionCh:
    // Try to adopt the most recently created job.
    if shouldClaim {
    claimJobs(ctx)
    }
    processClaimedJobs(ctx)
    case <-lc.timer.C:
    lc.timer.Read = true
    claimJobs(ctx)
    processClaimedJobs(ctx)
    lc.onExecute()
    }
    }
    })

Both of these will eventually try to obtain a live session through the
sqlinstance.Session here and here.

These two tasks may be present in the stopper when it stops as a result of the
expiry of a SQL instance session. The code to stop the process grabs l.mu,
but these tasks require l.mu to proceed, so we have a deadlock.

To Reproduce

Trying to reproduce a situation where only one task left (root caller) in the stopper
is difficult. However, a deadlock scenario on the mutexes can easily be reproduced
by running TPCH queries with a high concurrency for a few minutes on the SQL pod.
This assume a tenant with just one SQL pod, so all queries will be hitting it.

cockroach workload run tpch '<URL>'  --queries 1 --concurrency=1024 --tolerate-errors

The query above is going to cause overload on the SQL pod with a memory limit of 16GB.
When running that query, current SQL memory ramps up to hit 4GB, which is 25% of
16GB based on the --max-sql-memory flag.

Expected behavior

The SQL pod should terminate properly whenever a SQL instance session has expired.

Additional context

Some slack conversation can be found here: https://cockroachlabs.slack.com/archives/C01CDD4HRC5/p1633564911120200. I did not post logs as they contain CC specific information. This is causing instability in Serverless within the custom autoscaler logic because the pods will not shutdown. Marking it as a release-blocker as per @andy-kimball.

/cc @rimadeodhar @ajwerner @dhartunian

@jaylim-crl jaylim-crl added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. branch-release-21.2 labels Oct 7, 2021
@jaylim-crl
Copy link
Collaborator Author

One solution that was discussed is to run callbacks here asynchronously in goroutines. I'm not super familiar with the code, and I don't know if that is safe. Running callbacks in goroutines will solve the issue above because the root caller will terminate so the heartbeatLoop task in the stopper can finish, and it won't cause deadlocks because returning from the root caller will release l.mu, allowing tasks that require it to proceed.

@ajwerner
Copy link
Contributor

We should just have the callback launch a goroutine.

@ajwerner
Copy link
Contributor

I think if we put. a go here, it'll fix our problem:

@dhartunian
Copy link
Collaborator

@ajwerner i was going to add a go prior to the callback calls here:

Do you think it's better to just put it inside the shutdownSQLInstance callback only?

dhartunian added a commit to dhartunian/cockroach that referenced this issue Oct 11, 2021
Previously, the sqlliveness session expiry callbacks were called in the
heartbeatLoop thread which executed the renew/expiry logic. This could
cause deadlock since session expiration is used to trigger a shutdown of
the SQL instance via `stopper.Stop()`. The stopper would wait for all
async tasks to quiesce, but the `heartbeatLoop` would continue, waiting
for the callbacks to finish running. In addition, this task would hold a
lock on `l.mu` while waiting for the callbacks to run causing other
threads to wait if they needed to retrieve the session.

This change invokes each callback in its own goroutine to prevent this
deadlock.

Resolves cockroachdb#71292

Release note: None
dhartunian added a commit to dhartunian/cockroach that referenced this issue Oct 11, 2021
Previously, the sqlliveness session expiry callbacks were called in the
heartbeatLoop thread which executed the renew/expiry logic. This could
cause deadlock since session expiration is used to trigger a shutdown of
the SQL instance via `stopper.Stop()`. The stopper would wait for all
async tasks to quiesce, but the `heartbeatLoop` would continue, waiting
for the callbacks to finish running. In addition, this task would hold a
lock on `l.mu` while waiting for the callbacks to run causing other
threads to wait if they needed to retrieve the session.

This change invokes each callback in its own goroutine to prevent this
deadlock.

Resolves cockroachdb#71292

Release note: None
dhartunian added a commit to dhartunian/cockroach that referenced this issue Oct 11, 2021
Previously, the sqlliveness session expiry callbacks were called in the
heartbeatLoop thread which executed the renew/expiry logic. This could
cause deadlock since session expiration is used to trigger a shutdown of
the SQL instance via `stopper.Stop()`. The stopper would wait for all
async tasks to quiesce, but the `heartbeatLoop` would continue, waiting
for the callbacks to finish running. In addition, this task would hold a
lock on `l.mu` while waiting for the callbacks to run causing other
threads to wait if they needed to retrieve the session.

This change invokes each callback in its own goroutine to prevent this
deadlock.

Resolves cockroachdb#71292

Release note: None
dhartunian added a commit to dhartunian/cockroach that referenced this issue Oct 11, 2021
Previously, the sqlliveness session expiry callbacks were called in the
heartbeatLoop thread which executed the renew/expiry logic. This could
cause deadlock since session expiration is used to trigger a shutdown of
the SQL instance via `stopper.Stop()`. The stopper would wait for all
async tasks to quiesce, but the `heartbeatLoop` would continue, waiting
for the callbacks to finish running. In addition, this task would hold a
lock on `l.mu` while waiting for the callbacks to run causing other
threads to wait if they needed to retrieve the session.

This change invokes each callback in its own goroutine to prevent this
deadlock.

Resolves cockroachdb#71292

Release note: None
dhartunian added a commit to dhartunian/cockroach that referenced this issue Oct 11, 2021
Previously, the sqlliveness session expiry callbacks were called in the
heartbeatLoop thread which executed the renew/expiry logic. This could
cause deadlock since session expiration is used to trigger a shutdown of
the SQL instance via `stopper.Stop()`. The stopper would wait for all
async tasks to quiesce, but the `heartbeatLoop` would continue, waiting
for the callbacks to finish running. In addition, this task would hold a
lock on `l.mu` while waiting for the callbacks to run causing other
threads to wait if they needed to retrieve the session.

This change invokes each callback in its own goroutine to prevent this
deadlock.

Resolves cockroachdb#71292

Release note: None
craig bot pushed a commit that referenced this issue Oct 12, 2021
71140: sql: NFC-normalization on double-quoted identifiers r=rafiss a=e-mbrown

Resolves: #55396

There was a lack of NFC-normalization on double-quoted
identifiers. This allowed for ambiguous table/db/column names.
This change adds normalization and prevents this case.

Release note: None

71175: ui/cluster-ui: make app name a query search parameter in stmts page r=maryliag,matthewtodd a=xinhaoz

Fixes: #70790

Previously, the selected app was derived from a route param on the
statements page. All other filters are derived from query search
parameters on the page. This commit makes the app name a query search
parameter, as is the case in the transactions page.

Release note: None

71405: vendor: bump Pebble to 59007c613a66 r=jbowens a=nicktrav

```
59007c61 sstable: consolidate checksum types
17635b0a *: address staticcheck lints
11823273 *: address staticcheck lint check U1000
807abfe8 *: address staticcheck lint check S1039
d24dd342 all: remove some unused code
3bdca93a sstable: clarify comment on checksum input
926d23e9 pebble: remove comment related to batching from the table cache
0a6177ae db: tweak ErrClosed documentation
ecc685b2 db: expose facility for retrieving batch count
b2eb88a7 sstable: remove unused rawBlockIter err field
```

Release note: None.

71419: sqlliveness: session expiry callbacks must be async r=ajwerner,jaylim-crl a=dhartunian

Previously, the sqlliveness session expiry callbacks were called in the
heartbeatLoop thread which executed the renew/expiry logic. This could
cause deadlock since session expiration is used to trigger a shutdown of
the SQL instance via `stopper.Stop()`. The stopper would wait for all
async tasks to quiesce, but the `heartbeatLoop` would continue, waiting
for the callbacks to finish running. In addition, this task would hold a
lock on `l.mu` while waiting for the callbacks to run causing other
threads to wait if they needed to retrieve the session.

This change invokes each callback in its own goroutine to prevent this
deadlock.

Resolves #71292

Release note: None

Co-authored-by: e-mbrown <ebsonari@gmail.com>
Co-authored-by: Xin Hao Zhang <xzhang@cockroachlabs.com>
Co-authored-by: Nick Travers <travers@cockroachlabs.com>
Co-authored-by: David Hartunian <davidh@cockroachlabs.com>
@craig craig bot closed this as completed in 3586291 Oct 12, 2021
@jaylim-crl
Copy link
Collaborator Author

Thanks @dhartunian!

nehageorge pushed a commit to nehageorge/cockroach that referenced this issue Oct 12, 2021
Previously, the sqlliveness session expiry callbacks were called in the
heartbeatLoop thread which executed the renew/expiry logic. This could
cause deadlock since session expiration is used to trigger a shutdown of
the SQL instance via `stopper.Stop()`. The stopper would wait for all
async tasks to quiesce, but the `heartbeatLoop` would continue, waiting
for the callbacks to finish running. In addition, this task would hold a
lock on `l.mu` while waiting for the callbacks to run causing other
threads to wait if they needed to retrieve the session.

This change invokes each callback in its own goroutine to prevent this
deadlock.

Resolves cockroachdb#71292

Release note: None
andy-kimball pushed a commit that referenced this issue Oct 12, 2021
Previously, the sqlliveness session expiry callbacks were called in the
heartbeatLoop thread which executed the renew/expiry logic. This could
cause deadlock since session expiration is used to trigger a shutdown of
the SQL instance via `stopper.Stop()`. The stopper would wait for all
async tasks to quiesce, but the `heartbeatLoop` would continue, waiting
for the callbacks to finish running. In addition, this task would hold a
lock on `l.mu` while waiting for the callbacks to run causing other
threads to wait if they needed to retrieve the session.

This change invokes each callback in its own goroutine to prevent this
deadlock.

Resolves #71292

Release note: None
ericharmeling pushed a commit to ericharmeling/cockroach that referenced this issue Oct 20, 2021
Previously, the sqlliveness session expiry callbacks were called in the
heartbeatLoop thread which executed the renew/expiry logic. This could
cause deadlock since session expiration is used to trigger a shutdown of
the SQL instance via `stopper.Stop()`. The stopper would wait for all
async tasks to quiesce, but the `heartbeatLoop` would continue, waiting
for the callbacks to finish running. In addition, this task would hold a
lock on `l.mu` while waiting for the callbacks to run causing other
threads to wait if they needed to retrieve the session.

This change invokes each callback in its own goroutine to prevent this
deadlock.

Resolves cockroachdb#71292

Release note: None
andreimatei added a commit to andreimatei/cockroach that referenced this issue Dec 10, 2021
Clients can register callbacks with a sqlliveness instance, to be run on
instance shutdown. These callbacks started off as sync, but were made
async in cockroachdb#71419 in order to fix the deadlock described in cockroachdb#71292: one of
the callbacks wants to stop the stopper, but the callbacks were being
run inside a stopper task. Running all callbacks in goroutine was a less
than ideal fix, because the async callbacks were getting access to the
caller's tracing span. That span might get finished by the time the
callback runs (use-after-Finish).
Generally speaking, spawning a goroutine is not something to be done
lightly. Usually, some form of structured concurrency should be used to
prevent exactly this kind of problems. Spawning goroutines through
stopper tasks is one example of a concurrency structure, but it cannot
be used in this particular case. Modules that don't really need
concurrency should better avoid it, and the sqlliveness instance does
not need really concurrency.

Instead, this patch moves the burden of avoiding the deadlock to the one
callback that needs it. That guy now spawns a goroutine, and it takes
care to not capture the caller's span.

This patch also moves the testing against the deadlock closer to where
the deadlock was occuring, in sqlinstance/instanceprovider.

Release note: None
andreimatei added a commit to andreimatei/cockroach that referenced this issue Dec 10, 2021
Clients can register callbacks with a sqlliveness instance, to be run on
instance shutdown. These callbacks started off as sync, but were made
async in cockroachdb#71419 in order to fix the deadlock described in cockroachdb#71292: one of
the callbacks wants to stop the stopper, but the callbacks were being
run inside a stopper task. Running all callbacks in goroutine was a less
than ideal fix, because the async callbacks were getting access to the
caller's tracing span. That span might get finished by the time the
callback runs (use-after-Finish).
Generally speaking, spawning a goroutine is not something to be done
lightly. Usually, some form of structured concurrency should be used to
prevent exactly this kind of problems. Spawning goroutines through
stopper tasks is one example of a concurrency structure, but it cannot
be used in this particular case. Modules that don't really need
concurrency should better avoid it, and the sqlliveness instance does
not need really concurrency.

Instead, this patch moves the burden of avoiding the deadlock to the one
callback that needs it. That guy now spawns a goroutine, and it takes
care to not capture the caller's span.

This patch also moves the testing against the deadlock closer to where
the deadlock was occuring, in sqlinstance/instanceprovider.

Release note: None
andreimatei added a commit to andreimatei/cockroach that referenced this issue Dec 13, 2021
Clients can register callbacks with a sqlliveness instance, to be run on
instance shutdown. These callbacks started off as sync, but were made
async in cockroachdb#71419 in order to fix the deadlock described in cockroachdb#71292: one of
the callbacks wants to stop the stopper, but the callbacks were being
run inside a stopper task. Running all callbacks in goroutine was a less
than ideal fix, because the async callbacks were getting access to the
caller's tracing span. That span might get finished by the time the
callback runs (use-after-Finish).
Generally speaking, spawning a goroutine is not something to be done
lightly. Usually, some form of structured concurrency should be used to
prevent exactly this kind of problems. Spawning goroutines through
stopper tasks is one example of a concurrency structure, but it cannot
be used in this particular case. Modules that don't really need
concurrency should better avoid it, and the sqlliveness instance does
not need really concurrency.

Instead, this patch moves the burden of avoiding the deadlock to the one
callback that needs it. That guy now spawns a goroutine, and it takes
care to not capture the caller's span.

This patch also moves the testing against the deadlock closer to where
the deadlock was occuring, in sqlinstance/instanceprovider.

Release note: None
craig bot pushed a commit that referenced this issue Dec 14, 2021
73693: sqlliveness,sqlinstance: move back to sync callbacks  r=andreimatei a=andreimatei

Clients can register callbacks with a sqlliveness instance, to be run on
instance shutdown. These callbacks started off as sync, but were made
async in #71419 in order to fix the deadlock described in #71292: one of
the callbacks wants to stop the stopper, but the callbacks were being
run inside a stopper task. Running all callbacks in goroutine was a less
than ideal fix, because the async callbacks were getting access to the
caller's tracing span. That span might get finished by the time the
callback runs (use-after-Finish).
Generally speaking, spawning a goroutine is not something to be done
lightly. Usually, some form of structured concurrency should be used to
prevent exactly this kind of problems. Spawning goroutines through
stopper tasks is one example of a concurrency structure, but it cannot
be used in this particular case. Modules that don't really need
concurrency should better avoid it, and the sqlliveness instance does
not need really concurrency.

Instead, this patch moves the burden of avoiding the deadlock to the one
callback that needs it. That guy now spawns a goroutine, and it takes
care to not capture the caller's span.

This patch also moves the testing against the deadlock closer to where
the deadlock was occuring, in sqlinstance/instanceprovider.

Release note: None

Co-authored-by: Andrei Matei <andrei@cockroachlabs.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants