Skip to content
This repository has been archived by the owner on Nov 6, 2020. It is now read-only.

Fix ancient blocks sync #9531

Merged
merged 47 commits into from
Oct 9, 2018
Merged

Fix ancient blocks sync #9531

merged 47 commits into from
Oct 9, 2018

Conversation

ascjones
Copy link
Contributor

@ascjones ascjones commented Sep 11, 2018

Fixes #9300, fixes #7008 and fixes #9306. Rel #9407.

Edit: there were several issues here

TLDR;

  • Ancient blocks queue full, download resets, multiple stale responses cause download rounds with 0 blocks imported, retracting to block #0
  • After retracting to #0, can download a non canonical set of blocks
  • It can get stuck on the head of a non canonical chain, when it should retract to find a common block and resume the sync.

Retraction after queue full

  1. BlockDownloader in State::Blocks requesting headers for subchains:
  2. Ancient blocks queue gets full and the old blocks download is restarted
  3. BlockDownloader in State::ChainHead. In flight header response from abandoned round arrives, and resets subchain headers to consecutive headers (see the todo comment for a clue!). These headers may be orphaned from their parents which were discarded after the queue became full above.
  4. Then we get multiple sync rounds in quick succession which import 0 blocks because of UnknownParent after, causing the sync to quickly retract to 0. This was also being caused by stale body responses.

The fix is to expire all pending requests when resetting the sync.

Getting stuck on a non canonical block

  1. Some peers have bad blocks in the range #1..#57, probably propogated by the above bug. Also around the fork block #1920000 it was getting stuck on some non canon blocks.
  2. The hash gets stuck in the list of heads so it never gets empty and the round never completes.

The fix is to reset the sync if we receive consecutive header responses for a requesetd hash with no useful headers that advance the subchain.

Further improvements

The ancient blocks queue still fills up fairly frequently, resulting in many downloaded blocks being discarded. I will address this in a future PR, by pausing the download of old blocks similar to the normal sync. For now though this no longer triggers the retraction, and the ancient blocks sync now completes.

I also have some further improvements/refactorings for this code stashed away for future PRs. But trying to keep this PR to the minimum that will actually get the ancient block sync working again.

@ascjones ascjones added A3-inprogress ⏳ Pull request is in progress. No review needed at this stage. M4-core ⛓ Core client code / Rust. labels Sep 11, 2018
@ascjones ascjones changed the title Aj/fix ancient blocks sync Fix ancient blocks sync Sep 11, 2018
@ascjones ascjones added this to the 2.1 milestone Sep 11, 2018
This was referenced Sep 12, 2018
@@ -515,6 +528,7 @@ impl BlockDownloader {
},
Err(BlockImportError(BlockImportErrorKind::Queue(QueueErrorKind::Full(limit)), _)) => {
debug!(target: "sync", "Block import queue full ({}), restarting sync", limit);
bad = true;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is done to trigger resetting the sync right? Do you think we should return a different error type other than BlockDownloaderImportError::Invalid? (And also handle it in ChainSync::collect_blocks).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I agree and I had previously done that. However my idea with this initial PR was to do the minimum amount to fix this bug.

However looks like this fix doesn't work entirely so I will add that back while waiting for another mainnet sync...

@5chdn 5chdn modified the milestones: 2.1, 2.2 Sep 12, 2018
@ascjones
Copy link
Contributor Author

Unfortunately this does not appear to fix the issue when syncing mainnet. I'm currently investigating.

This is a problem on mainnet where multiple stale peer requests will
force many rounds to complete quickly, forcing the retraction.
@5chdn 5chdn added the B9-blocker 🚧 This pull request blocks the next release from happening. Use only in extreme cases. label Sep 13, 2018
@5chdn 5chdn removed the B9-blocker 🚧 This pull request blocks the next release from happening. Use only in extreme cases. label Sep 13, 2018
@ascjones
Copy link
Contributor Author

ascjones commented Oct 2, 2018

@jimpo have implemented your suggestions with the useless counter and the reset logic. Have run it 3 times against mainnet and it manages to get past the corrupt blocks. Also ran a couple of full syncs against POA network and it works okay.

Copy link
Contributor

@jimpo jimpo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ascjones ascjones removed the A0-pleasereview 🤓 Pull request needs code review. label Oct 3, 2018
@ascjones ascjones added the A0-pleasereview 🤓 Pull request needs code review. label Oct 8, 2018
Copy link
Contributor

@andresilva andresilva left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@andresilva
Copy link
Contributor

@ascjones Are the changes to the tests submodule expected? (maybe you updated the submodule unknowingly?)

@ascjones
Copy link
Contributor Author

ascjones commented Oct 9, 2018

Yes just noticed that too, I must've done that by accident. Fixing.

@ascjones ascjones added A8-looksgood 🦄 Pull request is reviewed well. and removed A0-pleasereview 🤓 Pull request needs code review. labels Oct 9, 2018
Copy link
Contributor

@ngotchac ngotchac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, minor grumble for typos in a comment. Good job!

@@ -491,8 +513,9 @@ impl BlockDownloader {
}

/// Checks if there are blocks fully downloaded that can be imported into the blockchain and does the import.
pub fn collect_blocks(&mut self, io: &mut SyncIo, allow_out_of_order: bool) -> Result<(), BlockDownloaderImportError> {
let mut bad = false;
/// Returns DownloadAction::Reset if it is imported all the the blocks it can and all downloading peers should be reset
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typos in the sentence.

@5chdn 5chdn merged commit 4b6ebcb into master Oct 9, 2018
@5chdn 5chdn deleted the aj/fix-ancient-blocks-sync branch October 9, 2018 13:31
dvdplm added a commit that referenced this pull request Oct 9, 2018
…mon-deps

* origin/master:
  Schedule nightly builds (#9717)
  Fix ancient blocks sync (#9531)
  CI: Skip docs job for nightly (#9693)
@5chdn 5chdn added the B0-patch-stable 🕷 Pull request should also be back-ported to the stable branch. label Oct 10, 2018
andresilva pushed a commit that referenced this pull request Oct 10, 2018
* Log block set in block_sync for easier debugging

* logging macros

* Match no args in sync logging macros

* Add QueueFull error

* Only allow importing headers if the first matches requested

* WIP

* Test for chain head gaps and log

* Calc distance even with 2 heads

* Revert previous commits, preparing simple fix

This reverts commit 5f38aa8.

* Reject headers with no gaps when ChainHead

* Reset block sync download when queue full

* Simplify check for subchain heads

* Add comment to explain subchain heads filter

* Fix is_subchain_heads check and comment

* Prevent premature round completion after restart

This is a problem on mainnet where multiple stale peer requests will
force many rounds to complete quickly, forcing the retraction.

* Reset stale old blocks request after queue full

* Revert "Reject headers with no gaps when ChainHead"

This reverts commit 0eb8655.

* Add BlockSet to BlockDownloader logging

Currently it is difficult to debug this because there are two instances,
one for OldBlocks and one for NewBlocks. This adds the BlockSet to all
log messages for easy log filtering.

* Reset OldBlocks download from last enqueued

Previously when the ancient block queue was full it would restart the
download from the last imported block, so the ones still in the queue would be
redownloaded. Keeping the existing downloader instance and just
resetting it will start again from the last enqueued block.:wq

* Ignore expired Body and Receipt requests

* Log when ancient block download being restarted

* Only request old blocks from peers with >= difficulty

#9226 might be too
permissive and causing the behaviour of the retraction soon after the
fork block. With this change the peer difficulty has to be greater than
or euqal to our syncing difficulty, so should still fix
#9225

* Some logging and clear stalled blocks head

* Revert "Some logging and clear stalled blocks head"

This reverts commit 757641d.

* Reset stalled header if useless more than once

* Store useless headers in HashSet

* Add sync target to logging macro

* Don't disable useless peer and fix log macro

* Clear useless headers on reset and comments

* Use custom error for collecting blocks

Previously we resued BlockImportError, however only the Invalid case and
this made little sense with the QueueFull error.

* Remove blank line

* Test for reset sync after consecutive useless headers

* Don't reset after consecutive headers when chain head

* Delete commented out imports

* Return DownloadAction from collect_blocks instead of error

* Don't reset after round complete, was causing test hangs

* Add comment explaining reset after useless

* Replace HashSet with counter for useless headers

* Refactor sync reset on bad block/queue full

* Add missing target for log message

* Fix compiler errors and test after merge

* ethcore: revert ethereum tests submodule update
@andresilva andresilva mentioned this pull request Oct 10, 2018
7 tasks
andresilva pushed a commit that referenced this pull request Oct 10, 2018
* Log block set in block_sync for easier debugging

* logging macros

* Match no args in sync logging macros

* Add QueueFull error

* Only allow importing headers if the first matches requested

* WIP

* Test for chain head gaps and log

* Calc distance even with 2 heads

* Revert previous commits, preparing simple fix

This reverts commit 5f38aa8.

* Reject headers with no gaps when ChainHead

* Reset block sync download when queue full

* Simplify check for subchain heads

* Add comment to explain subchain heads filter

* Fix is_subchain_heads check and comment

* Prevent premature round completion after restart

This is a problem on mainnet where multiple stale peer requests will
force many rounds to complete quickly, forcing the retraction.

* Reset stale old blocks request after queue full

* Revert "Reject headers with no gaps when ChainHead"

This reverts commit 0eb8655.

* Add BlockSet to BlockDownloader logging

Currently it is difficult to debug this because there are two instances,
one for OldBlocks and one for NewBlocks. This adds the BlockSet to all
log messages for easy log filtering.

* Reset OldBlocks download from last enqueued

Previously when the ancient block queue was full it would restart the
download from the last imported block, so the ones still in the queue would be
redownloaded. Keeping the existing downloader instance and just
resetting it will start again from the last enqueued block.:wq

* Ignore expired Body and Receipt requests

* Log when ancient block download being restarted

* Only request old blocks from peers with >= difficulty

#9226 might be too
permissive and causing the behaviour of the retraction soon after the
fork block. With this change the peer difficulty has to be greater than
or euqal to our syncing difficulty, so should still fix
#9225

* Some logging and clear stalled blocks head

* Revert "Some logging and clear stalled blocks head"

This reverts commit 757641d.

* Reset stalled header if useless more than once

* Store useless headers in HashSet

* Add sync target to logging macro

* Don't disable useless peer and fix log macro

* Clear useless headers on reset and comments

* Use custom error for collecting blocks

Previously we resued BlockImportError, however only the Invalid case and
this made little sense with the QueueFull error.

* Remove blank line

* Test for reset sync after consecutive useless headers

* Don't reset after consecutive headers when chain head

* Delete commented out imports

* Return DownloadAction from collect_blocks instead of error

* Don't reset after round complete, was causing test hangs

* Add comment explaining reset after useless

* Replace HashSet with counter for useless headers

* Refactor sync reset on bad block/queue full

* Add missing target for log message

* Fix compiler errors and test after merge

* ethcore: revert ethereum tests submodule update
5chdn pushed a commit that referenced this pull request Oct 10, 2018
* produce portable binaries (#9725)

* HF in POA Core (2018-10-22) (#9724)

poanetwork/poa-chain-spec#87

* Use static call and apparent value transfer for block reward contract code (#9603)

* Verify block syncing responses against requests (#9670)

* sync: Validate received BlockHeaders packets against stored request.

* sync: Validate received BlockBodies and BlockReceipts.

* sync: Fix broken tests.

* sync: Unit tests for BlockDownloader::import_headers.

* sync: Unit tests for import_{bodies,receipts}.

* tests: Add missing method doc.

* Fix ancient blocks sync (#9531)

* Log block set in block_sync for easier debugging

* logging macros

* Match no args in sync logging macros

* Add QueueFull error

* Only allow importing headers if the first matches requested

* WIP

* Test for chain head gaps and log

* Calc distance even with 2 heads

* Revert previous commits, preparing simple fix

This reverts commit 5f38aa8.

* Reject headers with no gaps when ChainHead

* Reset block sync download when queue full

* Simplify check for subchain heads

* Add comment to explain subchain heads filter

* Fix is_subchain_heads check and comment

* Prevent premature round completion after restart

This is a problem on mainnet where multiple stale peer requests will
force many rounds to complete quickly, forcing the retraction.

* Reset stale old blocks request after queue full

* Revert "Reject headers with no gaps when ChainHead"

This reverts commit 0eb8655.

* Add BlockSet to BlockDownloader logging

Currently it is difficult to debug this because there are two instances,
one for OldBlocks and one for NewBlocks. This adds the BlockSet to all
log messages for easy log filtering.

* Reset OldBlocks download from last enqueued

Previously when the ancient block queue was full it would restart the
download from the last imported block, so the ones still in the queue would be
redownloaded. Keeping the existing downloader instance and just
resetting it will start again from the last enqueued block.:wq

* Ignore expired Body and Receipt requests

* Log when ancient block download being restarted

* Only request old blocks from peers with >= difficulty

#9226 might be too
permissive and causing the behaviour of the retraction soon after the
fork block. With this change the peer difficulty has to be greater than
or euqal to our syncing difficulty, so should still fix
#9225

* Some logging and clear stalled blocks head

* Revert "Some logging and clear stalled blocks head"

This reverts commit 757641d.

* Reset stalled header if useless more than once

* Store useless headers in HashSet

* Add sync target to logging macro

* Don't disable useless peer and fix log macro

* Clear useless headers on reset and comments

* Use custom error for collecting blocks

Previously we resued BlockImportError, however only the Invalid case and
this made little sense with the QueueFull error.

* Remove blank line

* Test for reset sync after consecutive useless headers

* Don't reset after consecutive headers when chain head

* Delete commented out imports

* Return DownloadAction from collect_blocks instead of error

* Don't reset after round complete, was causing test hangs

* Add comment explaining reset after useless

* Replace HashSet with counter for useless headers

* Refactor sync reset on bad block/queue full

* Add missing target for log message

* Fix compiler errors and test after merge

* ethcore: revert ethereum tests submodule update

* Add hardcoded headers (#9730)

* add foundation hardcoded header #6486017

* add ropsten hardcoded headers #4202497

* add kovan hardcoded headers #9023489

* gitlab ci: releasable_branches: change variables condition to schedule (#9729)
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
A8-looksgood 🦄 Pull request is reviewed well. B0-patch-stable 🕷 Pull request should also be back-ported to the stable branch. M4-core ⛓ Core client code / Rust.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Invalid old blocks Background verifier sometimes goes back Ancient block sync stalls
6 participants