Skip to content
This repository has been archived by the owner on Oct 28, 2021. It is now read-only.

Fix race condition which permanently pauses sync #5865

Merged
merged 3 commits into from
Dec 12, 2019
Merged

Conversation

halfalicious
Copy link
Contributor

Fix #5312

The race condition occurs when the block queue is full so syncing is paused, and the function which detects when the block queue has room (BlockQueue::drain) does the first of the two-part check when a block queue verifier thread has temporarily removed block data from the verification queues.
As such, the "room available" detection fails and drain removes more blocks, which means that the block queue enters a permanently not full state and syncing is paused indefinitely.

This is fixed by also performing the "room available" detection check in the body of the block queue verifier threads.

@codecov-io
Copy link

codecov-io commented Dec 6, 2019

Codecov Report

Merging #5865 into master will decrease coverage by 0.07%.
The diff coverage is 25%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #5865      +/-   ##
==========================================
- Coverage   64.05%   63.97%   -0.08%     
==========================================
  Files         364      364              
  Lines       30994    30992       -2     
  Branches     3435     3435              
==========================================
- Hits        19852    19826      -26     
- Misses       9917     9933      +16     
- Partials     1225     1233       +8

The problem is that the knownFull -> !knownFull check in
BlockQueue::drain() doesn't catch cases where the block queue is
modified on a verifier thread - these cases are when a bad block is
detected and the block and all of its children are removed from
m_verifying, or when a block (with bytes) is moved from m_unverified to
m_verifying (without bytes).

The fix is to always fire the onRoomAvailable handler (renamed to onBlocksDrained) at the end of
doneDrain, and in the handler to only resume syncing if it's paused and
the block queue isn't full.
Copy link
Member

@gumb0 gumb0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, did you try running the sync?

@halfalicious
Copy link
Contributor Author

halfalicious commented Dec 11, 2019

@gumb0: Yes kicked it off last night, will let it run for a bit before merging.

@halfalicious
Copy link
Contributor Author

Ran sync for ~12 hours and sync'd from genesis to block 1.4M without issues.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Sync gets stuck
3 participants