Fix ancient blocks sync #9531

ascjones · 2018-09-11T17:23:29Z

Fixes #9300, fixes #7008 and fixes #9306. Rel #9407.

Edit: there were several issues here

TLDR;

Ancient blocks queue full, download resets, multiple stale responses cause download rounds with 0 blocks imported, retracting to block #0
After retracting to #0, can download a non canonical set of blocks
It can get stuck on the head of a non canonical chain, when it should retract to find a common block and resume the sync.

Retraction after queue full

BlockDownloader in State::Blocks requesting headers for subchains:
Ancient blocks queue gets full and the old blocks download is restarted
BlockDownloader in State::ChainHead. In flight header response from abandoned round arrives, and resets subchain headers to consecutive headers (see the todo comment for a clue!). These headers may be orphaned from their parents which were discarded after the queue became full above.
Then we get multiple sync rounds in quick succession which import 0 blocks because of UnknownParent after, causing the sync to quickly retract to 0. This was also being caused by stale body responses.

The fix is to expire all pending requests when resetting the sync.

Getting stuck on a non canonical block

Some peers have bad blocks in the range #1..#57, probably propogated by the above bug. Also around the fork block #1920000 it was getting stuck on some non canon blocks.
The hash gets stuck in the list of heads so it never gets empty and the round never completes.

The fix is to reset the sync if we receive consecutive header responses for a requesetd hash with no useful headers that advance the subchain.

Further improvements

The ancient blocks queue still fills up fairly frequently, resulting in many downloaded blocks being discarded. I will address this in a future PR, by pausing the download of old blocks similar to the normal sync. For now though this no longer triggers the retraction, and the ancient blocks sync now completes.

I also have some further improvements/refactorings for this code stashed away for future PRs. But trying to keep this PR to the minimum that will actually get the ancient block sync working again.

This reverts commit 5f38aa8.

andresilva · 2018-09-12T10:55:59Z

ethcore/sync/src/block_sync.rs

@@ -515,6 +528,7 @@ impl BlockDownloader {
 				},
 				Err(BlockImportError(BlockImportErrorKind::Queue(QueueErrorKind::Full(limit)), _)) => {
 					debug!(target: "sync", "Block import queue full ({}), restarting sync", limit);
+					bad = true;


This is done to trigger resetting the sync right? Do you think we should return a different error type other than BlockDownloaderImportError::Invalid? (And also handle it in ChainSync::collect_blocks).

Yes I agree and I had previously done that. However my idea with this initial PR was to do the minimum amount to fix this bug.

However looks like this fix doesn't work entirely so I will add that back while waiting for another mainnet sync...

ascjones · 2018-09-12T12:15:56Z

Unfortunately this does not appear to fix the issue when syncing mainnet. I'm currently investigating.

This is a problem on mainnet where multiple stale peer requests will force many rounds to complete quickly, forcing the retraction.

ascjones · 2018-10-02T17:28:09Z

@jimpo have implemented your suggestions with the useless counter and the reset logic. Have run it 3 times against mainnet and it manages to get past the corrupt blocks. Also ran a couple of full syncs against POA network and it works okay.

jimpo

LGTM

andresilva

LGTM

andresilva · 2018-10-09T08:33:52Z

@ascjones Are the changes to the tests submodule expected? (maybe you updated the submodule unknowingly?)

ascjones · 2018-10-09T08:35:09Z

Yes just noticed that too, I must've done that by accident. Fixing.

ngotchac

LGTM, minor grumble for typos in a comment. Good job!

ngotchac · 2018-10-09T12:38:14Z

ethcore/sync/src/block_sync.rs

@@ -491,8 +513,9 @@ impl BlockDownloader {
 	}

 	/// Checks if there are blocks fully downloaded that can be imported into the blockchain and does the import.
-	pub fn collect_blocks(&mut self, io: &mut SyncIo, allow_out_of_order: bool) -> Result<(), BlockDownloaderImportError> {
-		let mut bad = false;
+	/// Returns DownloadAction::Reset if it is imported all the the blocks it can and all downloading peers should be reset


Typos in the sentence.

…mon-deps * origin/master: Schedule nightly builds (#9717) Fix ancient blocks sync (#9531) CI: Skip docs job for nightly (#9693)

* Log block set in block_sync for easier debugging * logging macros * Match no args in sync logging macros * Add QueueFull error * Only allow importing headers if the first matches requested * WIP * Test for chain head gaps and log * Calc distance even with 2 heads * Revert previous commits, preparing simple fix This reverts commit 5f38aa8. * Reject headers with no gaps when ChainHead * Reset block sync download when queue full * Simplify check for subchain heads * Add comment to explain subchain heads filter * Fix is_subchain_heads check and comment * Prevent premature round completion after restart This is a problem on mainnet where multiple stale peer requests will force many rounds to complete quickly, forcing the retraction. * Reset stale old blocks request after queue full * Revert "Reject headers with no gaps when ChainHead" This reverts commit 0eb8655. * Add BlockSet to BlockDownloader logging Currently it is difficult to debug this because there are two instances, one for OldBlocks and one for NewBlocks. This adds the BlockSet to all log messages for easy log filtering. * Reset OldBlocks download from last enqueued Previously when the ancient block queue was full it would restart the download from the last imported block, so the ones still in the queue would be redownloaded. Keeping the existing downloader instance and just resetting it will start again from the last enqueued block.:wq * Ignore expired Body and Receipt requests * Log when ancient block download being restarted * Only request old blocks from peers with >= difficulty #9226 might be too permissive and causing the behaviour of the retraction soon after the fork block. With this change the peer difficulty has to be greater than or euqal to our syncing difficulty, so should still fix #9225 * Some logging and clear stalled blocks head * Revert "Some logging and clear stalled blocks head" This reverts commit 757641d. * Reset stalled header if useless more than once * Store useless headers in HashSet * Add sync target to logging macro * Don't disable useless peer and fix log macro * Clear useless headers on reset and comments * Use custom error for collecting blocks Previously we resued BlockImportError, however only the Invalid case and this made little sense with the QueueFull error. * Remove blank line * Test for reset sync after consecutive useless headers * Don't reset after consecutive headers when chain head * Delete commented out imports * Return DownloadAction from collect_blocks instead of error * Don't reset after round complete, was causing test hangs * Add comment explaining reset after useless * Replace HashSet with counter for useless headers * Refactor sync reset on bad block/queue full * Add missing target for log message * Fix compiler errors and test after merge * ethcore: revert ethereum tests submodule update

* produce portable binaries (#9725) * HF in POA Core (2018-10-22) (#9724) poanetwork/poa-chain-spec#87 * Use static call and apparent value transfer for block reward contract code (#9603) * Verify block syncing responses against requests (#9670) * sync: Validate received BlockHeaders packets against stored request. * sync: Validate received BlockBodies and BlockReceipts. * sync: Fix broken tests. * sync: Unit tests for BlockDownloader::import_headers. * sync: Unit tests for import_{bodies,receipts}. * tests: Add missing method doc. * Fix ancient blocks sync (#9531) * Log block set in block_sync for easier debugging * logging macros * Match no args in sync logging macros * Add QueueFull error * Only allow importing headers if the first matches requested * WIP * Test for chain head gaps and log * Calc distance even with 2 heads * Revert previous commits, preparing simple fix This reverts commit 5f38aa8. * Reject headers with no gaps when ChainHead * Reset block sync download when queue full * Simplify check for subchain heads * Add comment to explain subchain heads filter * Fix is_subchain_heads check and comment * Prevent premature round completion after restart This is a problem on mainnet where multiple stale peer requests will force many rounds to complete quickly, forcing the retraction. * Reset stale old blocks request after queue full * Revert "Reject headers with no gaps when ChainHead" This reverts commit 0eb8655. * Add BlockSet to BlockDownloader logging Currently it is difficult to debug this because there are two instances, one for OldBlocks and one for NewBlocks. This adds the BlockSet to all log messages for easy log filtering. * Reset OldBlocks download from last enqueued Previously when the ancient block queue was full it would restart the download from the last imported block, so the ones still in the queue would be redownloaded. Keeping the existing downloader instance and just resetting it will start again from the last enqueued block.:wq * Ignore expired Body and Receipt requests * Log when ancient block download being restarted * Only request old blocks from peers with >= difficulty #9226 might be too permissive and causing the behaviour of the retraction soon after the fork block. With this change the peer difficulty has to be greater than or euqal to our syncing difficulty, so should still fix #9225 * Some logging and clear stalled blocks head * Revert "Some logging and clear stalled blocks head" This reverts commit 757641d. * Reset stalled header if useless more than once * Store useless headers in HashSet * Add sync target to logging macro * Don't disable useless peer and fix log macro * Clear useless headers on reset and comments * Use custom error for collecting blocks Previously we resued BlockImportError, however only the Invalid case and this made little sense with the QueueFull error. * Remove blank line * Test for reset sync after consecutive useless headers * Don't reset after consecutive headers when chain head * Delete commented out imports * Return DownloadAction from collect_blocks instead of error * Don't reset after round complete, was causing test hangs * Add comment explaining reset after useless * Replace HashSet with counter for useless headers * Refactor sync reset on bad block/queue full * Add missing target for log message * Fix compiler errors and test after merge * ethcore: revert ethereum tests submodule update * Add hardcoded headers (#9730) * add foundation hardcoded header #6486017 * add ropsten hardcoded headers #4202497 * add kovan hardcoded headers #9023489 * gitlab ci: releasable_branches: change variables condition to schedule (#9729)

ascjones added 14 commits September 7, 2018 17:44

Log block set in block_sync for easier debugging

5f38aa8

logging macros

5f219a4

Match no args in sync logging macros

8d7f742

Add QueueFull error

8f379ee

Only allow importing headers if the first matches requested

f5d244e

WIP

f77c503

Test for chain head gaps and log

0f3adb3

Calc distance even with 2 heads

ad7bb2e

Revert previous commits, preparing simple fix

56747ef

This reverts commit 5f38aa8.

Reject headers with no gaps when ChainHead

0eb8655

Reset block sync download when queue full

72783a2

Simplify check for subchain heads

6f7c3c2

Add comment to explain subchain heads filter

b4033f3

Fix is_subchain_heads check and comment

7d1bc76

ascjones added A3-inprogress ⏳ Pull request is in progress. No review needed at this stage. M4-core ⛓ Core client code / Rust. labels Sep 11, 2018

ascjones changed the title ~~Aj/fix ancient blocks sync~~ Fix ancient blocks sync Sep 11, 2018

ascjones mentioned this pull request Sep 11, 2018

ethcore: fix ancient block sync #9407

Closed

ascjones added the B0-patchthis 🕷 label Sep 11, 2018

ascjones added this to the 2.1 milestone Sep 11, 2018

This was referenced Sep 12, 2018

Backports for 2.0.5 stable #9519

Merged

Backports for 2.1.0 beta #9518

Merged

andresilva reviewed Sep 12, 2018

View reviewed changes

5chdn modified the milestones: 2.1, 2.2 Sep 12, 2018

Prevent premature round completion after restart

9588e5a

This is a problem on mainnet where multiple stale peer requests will force many rounds to complete quickly, forcing the retraction.

5chdn added the B9-blocker 🚧 This pull request blocks the next release from happening. Use only in extreme cases. label Sep 13, 2018

Reset stale old blocks request after queue full

9b31755

5chdn removed the B9-blocker 🚧 This pull request blocks the next release from happening. Use only in extreme cases. label Sep 13, 2018

ascjones added 2 commits October 2, 2018 17:55

Refactor sync reset on bad block/queue full

b8418e4

Add missing target for log message

0ec672b

jimpo approved these changes Oct 2, 2018

View reviewed changes

Merge branch 'master' into aj/fix-ancient-blocks-sync

3bfc2ae

ascjones removed the A0-pleasereview 🤓 Pull request needs code review. label Oct 3, 2018

Fix compiler errors and test after merge

e72d150

ascjones added the A0-pleasereview 🤓 Pull request needs code review. label Oct 8, 2018

andresilva approved these changes Oct 9, 2018

View reviewed changes

ethcore: revert ethereum tests submodule update

d77ac44

ascjones added A8-looksgood 🦄 Pull request is reviewed well. and removed A0-pleasereview 🤓 Pull request needs code review. labels Oct 9, 2018

ngotchac approved these changes Oct 9, 2018

View reviewed changes

5chdn merged commit 4b6ebcb into master Oct 9, 2018

5chdn deleted the aj/fix-ancient-blocks-sync branch October 9, 2018 13:31

dvdplm added a commit that referenced this pull request Oct 9, 2018

Merge remote-tracking branch 'origin/master' into dp/chore/update-com…

7d12a51

…mon-deps * origin/master: Schedule nightly builds (#9717) Fix ancient blocks sync (#9531) CI: Skip docs job for nightly (#9693)

5chdn added the B0-patch-stable 🕷 Pull request should also be back-ported to the stable branch. label Oct 10, 2018

andresilva mentioned this pull request Oct 10, 2018

[beta] More backports for 2.1.2 #9733

Merged

7 tasks

This was referenced Oct 15, 2018

sync: retry different peer after empty subchain heads response #9753

Merged

sync: prevent ancient block import queue becoming full #9754

Closed

phahulin mentioned this pull request Nov 1, 2018

After upgrade to 1.11.8 in PoA-network validators miss blocks #9323

Closed

ascjones mentioned this pull request Nov 2, 2018

Recover from stuck ancient blocks without syncing from scratch #9859

Closed

grbIzl mentioned this pull request Jul 8, 2019

Treat blocks, that already downloaded, as synced #10864

Closed

phahulin mentioned this pull request Oct 9, 2019

Resyncing ropsten always ends up on a wrong chain #11147

Closed

grbIzl mentioned this pull request Nov 15, 2019

Treat blocks, that already downloaded, as synced #11264

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix ancient blocks sync #9531

Fix ancient blocks sync #9531

ascjones commented Sep 11, 2018 •

edited

Loading

andresilva Sep 12, 2018

ascjones Sep 12, 2018

ascjones commented Sep 12, 2018

ascjones commented Oct 2, 2018

jimpo left a comment

andresilva left a comment

andresilva commented Oct 9, 2018

ascjones commented Oct 9, 2018

ngotchac left a comment

ngotchac Oct 9, 2018

Fix ancient blocks sync #9531

Fix ancient blocks sync #9531

Conversation

ascjones commented Sep 11, 2018 • edited Loading

TLDR;

Retraction after queue full

Getting stuck on a non canonical block

Further improvements

andresilva Sep 12, 2018

Choose a reason for hiding this comment

ascjones Sep 12, 2018

Choose a reason for hiding this comment

ascjones commented Sep 12, 2018

ascjones commented Oct 2, 2018

jimpo left a comment

Choose a reason for hiding this comment

andresilva left a comment

Choose a reason for hiding this comment

andresilva commented Oct 9, 2018

ascjones commented Oct 9, 2018

ngotchac left a comment

Choose a reason for hiding this comment

ngotchac Oct 9, 2018

Choose a reason for hiding this comment

ascjones commented Sep 11, 2018 •

edited

Loading