Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

internal/state: avoid deadlock caused by mid-flight job change #818

Merged
merged 2 commits into from
Mar 4, 2022

Conversation

radeksimko
Copy link
Member

There is a relatively rare case of when we may end up processing the same job twice here:

func (js *JobStore) awaitNextJob(ctx context.Context, openDir bool) (job.ID, job.Job, error) {
txn := js.db.Txn(false)
wCh, obj, err := txn.FirstWatch(js.tableName, "is_dir_open_state", openDir, StateQueued)
if err != nil {
return "", job.Job{}, err
}
if obj == nil {
select {
case <-wCh:
case <-ctx.Done():
return "", job.Job{}, ctx.Err()
}
return js.awaitNextJob(ctx, openDir)
}
sJob := obj.(*ScheduledJob)
err = js.markJobAsRunning(sJob)
if err != nil {
return "", job.Job{}, err
}

  1. a job e.g. ID 42 is found by e.g. "closed scheduler" (*)
  2. in the split second between finding this job and marking it as running (which itself is correctly guarded by an internal memdb mutex), the involved directory is open.
  3. the same job ID 42 is then also found by "open scheduler" (**)
  4. whichever scheduler obtains the memdb mutex first marks that job as running
  5. the other scheduler then waits for the mutex and attempts to do the same
  6. because that fails, we return error, making the scheduler to effectively shut down, practically causing a deadlock in places which wait for jobs to finish - as they'd never finish

(*) - the scheduler responsible for processing jobs involving directories which are not open
(**) - the scheduler responsible for processing jobs involving directories which are open

While this may be rare, I ran into that bug myself this morning, as can be seen from this log snippet:
https://gist.github.com/radeksimko/7ce4e9a7d0c8bbc65bd65e277dc3bd92

@radeksimko radeksimko added the bug Something isn't working label Mar 3, 2022
@radeksimko radeksimko added this to the v0.26.0 milestone Mar 3, 2022
@radeksimko radeksimko changed the title internal/state: avoid shutting down scheduler on duplicate job internal/state: avoid deadlock caused by mid-flight job change Mar 3, 2022
…ther

Multiple runs may be necessary to trigger the bug, 10 seems to be enough to do so in my environment:

go test ./internal/scheduler -run=TestScheduler_closedAndOpen -count=10 -timeout=3s
@radeksimko radeksimko requested a review from a team March 4, 2022 09:27
Copy link
Member

@dbanck dbanck left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch and excellent test case!

@radeksimko radeksimko merged commit 4f16943 into main Mar 4, 2022
@radeksimko radeksimko deleted the b-avoid-duplicate-job-await branch March 4, 2022 13:42
@radeksimko radeksimko self-assigned this Mar 4, 2022
@github-actions
Copy link

This functionality has been released in v0.26.0 of the language server.
If you use the official Terraform VS Code extension, it will prompt you to upgrade to this version automatically upon next launch or within the next 24 hours.

For further feature requests or bug reports with this functionality, please create a new GitHub issue following the template. Thank you!

@github-actions
Copy link

I'm going to lock this pull request because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Apr 21, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants