Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

3. fix(state): prevent watch channel deadlocks in the state #3870

Merged
merged 6 commits into from
Mar 17, 2022

Conversation

teor2345
Copy link
Contributor

@teor2345 teor2345 commented Mar 14, 2022

Motivation

This PR makes watch channel deadlocks in the state impossible, which avoids deadlocks. (And time-consuming investigations into deadlock causes.)

It also avoids some livelocks, and some performance issues. (Clones are cheap, lock contention is expensive.)

Specifications

If the watch receiver holds a read lock, then the sender tries to get the write lock, then another receiver tries to get another read lock, the watch channel can deadlock.

See the "potential deadlock example" in:
https://docs.rs/tokio/latest/tokio/sync/watch/struct.Receiver.html#method.borrow

Designs

The WatchReceiver wrapper clones the watched data before allowing access to it, then explicitly drops the read lock as soon as possible. This prevents:

  • read-write-read deadlocks, and
  • read-write livelocks under heavy use,

because the read lock is dropped immediately after the clone.

For more details, see the conversation at #3847 (comment)

Solution

Deadlock prevention:

  • Add a WatchReceiver wrapper that always clones the borrowed watch data
  • Explicitly drop the read lock as quickly as possible

Memory optimisations:

  • Drop the shared Arcs as quickly as possible

Ergonomics:

  • Make read::block more flexible, by accepting any AsRef
  • Make the block method docs consistent

Testing

When testing syncing lightwalletd from a zebrad that is also syncing from peers, I saw pauses of up to 3 minutes. These pauses finished when lightwalletd gave up and exited. The pauses seem to be fixed by commit 3240332.

Review

This is part of an high-priority series of state refactor PRs, because they will cause merge conflicts with other PRs.

Reviewer Checklist

  • Code implements Specs and Designs
  • Tests for Expected Behaviour
  • Tests for Errors

Follow Up Work

@teor2345 teor2345 added C-bug Category: This is a bug P-High 🔥 I-hang A Zebra component stops responding to requests A-state Area: State / database changes labels Mar 14, 2022
@teor2345 teor2345 requested a review from a team as a code owner March 14, 2022 23:13
@teor2345 teor2345 self-assigned this Mar 14, 2022
@teor2345 teor2345 requested review from jvff and conradoplg and removed request for a team and jvff March 14, 2022 23:13
@teor2345 teor2345 changed the title 2. Prevent watch channel deadlocks in the state 2b. Prevent watch channel deadlocks in the state Mar 14, 2022
conradoplg
conradoplg previously approved these changes Mar 15, 2022
Copy link
Collaborator

@conradoplg conradoplg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good

Base automatically changed from read-only-state-service-impl to main March 15, 2022 19:50
@teor2345 teor2345 force-pushed the fix-state-watch-deadlocks branch from 076ceb5 to c536e0d Compare March 15, 2022 20:00
@teor2345
Copy link
Contributor Author

@Mergifyio refresh

@mergify
Copy link
Contributor

mergify bot commented Mar 15, 2022

refresh

✅ Pull request refreshed

@teor2345 teor2345 changed the title 2b. Prevent watch channel deadlocks in the state 3b. Prevent watch channel deadlocks in the state Mar 15, 2022
@teor2345 teor2345 changed the title 3b. Prevent watch channel deadlocks in the state 3b. fix(state): prevent watch channel deadlocks in the state Mar 15, 2022
@codecov
Copy link

codecov bot commented Mar 15, 2022

Codecov Report

Merging #3870 (50fcb7d) into main (641f488) will increase coverage by 0.09%.
The diff coverage is 92.10%.

❗ Current head 50fcb7d differs from pull request most recent head 3240332. Consider uploading reports for the commit 3240332 to get more accurate results

@@            Coverage Diff             @@
##             main    #3870      +/-   ##
==========================================
+ Coverage   78.68%   78.78%   +0.09%     
==========================================
  Files         296      297       +1     
  Lines       33833    33879      +46     
==========================================
+ Hits        26622    26690      +68     
+ Misses       7211     7189      -22     

@dconnolly dconnolly requested a review from jvff March 15, 2022 22:30
@dconnolly
Copy link
Contributor

Looks ok to me but I kinda would like @jvff to look at it.

@teor2345 teor2345 force-pushed the fix-state-watch-deadlocks branch from c536e0d to 50fcb7d Compare March 16, 2022 00:03
@teor2345 teor2345 changed the title 3b. fix(state): prevent watch channel deadlocks in the state 3. fix(state): prevent watch channel deadlocks in the state Mar 16, 2022
@teor2345 teor2345 force-pushed the fix-state-watch-deadlocks branch from 50fcb7d to 3240332 Compare March 16, 2022 04:03
Copy link
Contributor

@dconnolly dconnolly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No comments, since this is High 🔥 let's merge and hopefully @jvff can give a look if they are available

mergify bot added a commit that referenced this pull request Mar 16, 2022
@teor2345 teor2345 added the lightwalletd any work associated with lightwalletd label Mar 16, 2022
@mergify mergify bot merged commit 67b3679 into main Mar 17, 2022
@mergify mergify bot deleted the fix-state-watch-deadlocks branch March 17, 2022 00:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-state Area: State / database changes C-bug Category: This is a bug I-hang A Zebra component stops responding to requests lightwalletd any work associated with lightwalletd
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants