Add a download set. #1041

hdevalence · 2020-09-09T23:14:56Z

WIP, cannot be tested because state changes broke the sync component. Based on top of #1040.

hdevalence · 2020-09-21T17:15:41Z

Rebased onto main to pick up the state fixes and the improved message handling work.

zebrad/src/components/sync.rs

teor2345

This is a good refactor - it should make the sync process easier to understand and maintain.

I'm looking forward to seeing the test results. Let me know when you're ready for a review or testing.

zebrad/src/components/sync.rs

hdevalence · 2020-09-22T18:42:45Z

Rebased onto main again and redid the cancellation handling in the Connection state machine to cancel pending requests (i.e., handle request cancellations that happen after the request arrives at the connection).

zebrad/src/components/sync/downloads.rs

dconnolly

Approved once the clippy error is fixed

teor2345 · 2020-10-19T05:20:21Z

Could this PR fix the sync hang in #1181?

zebrad/src/components/sync.rs

teor2345

The code looks good, but I haven't really had a chance to run it.

This PR doesn't contain any tests, what's the plan for testing it?

teor2345 · 2020-10-21T23:09:44Z

PR #1193 adds a limited sync test for the first large checkpoint, but it won't find later stalls or errors.

zebrad/src/components/sync.rs

dconnolly

Looks good to me but I'll wait on merging per others' feedback

hdevalence · 2020-10-22T06:06:17Z

This includes the changes from #1089, in an effort to get the simplified sync logic done in one change. The sync algorithm as designed had some invariants that the different parts were supposed to maintain together, and I think we ran into problems changing some of the parts one at a time. So, this PR tries to clean up the sync algorithm in a coherent way. I tested it on mainnet and testnet. On testnet it makes progress but is very slow, on account of the bad condition of the testnet.

There are some ancillary changes made along the way while testing, mainly to the span and tracing handling. There's a new debug line added before queuing downloads that reveals that a big contributor to latency of the sync while checkpointing mainnet is the state service (checking that new hashes are not already part of the state). I think that this could be safely omitted from the extendtips step, but not obtaintips. (That was part of the original motivation for having those be two separate steps).

zebrad/src/components/sync.rs

teor2345

I'm seeing a lot of errors and state issues after this PR. But the main branch and #1198 work fine.

Oct 23 16:42:13.201  INFO sync:obtain_tips:download_and_verify{hash=block::Hash("00000060a9b03c2f920c3e6e0b92c5b8a253d035222dd7e6a379a126a9682135")}:checkpoint: zebra_consensus::checkpoint: verified checkpoint range block_count=4000 current_range=(Excluded(Height(0)), Included(Height(4000)))
Oct 23 16:42:13.601  WARN sync: zebrad::components::sync: e=Checkpoint(Duplicate { height: Height(215) })
Oct 23 16:42:13.601  INFO sync: zebrad::components::sync: waiting to restart sync timeout=45s
Oct 23 16:42:58.603  INFO sync: zebrad::components::sync: starting sync, obtaining new tips
Oct 23 16:42:58.603  INFO sync:obtain_tips: zebra_state::util: created block locator tip_height=Height(2) min_locator_height=0 locators=[Height(2), Height(1), Height(0)]

Note that 4000 blocks were verified, but the block locator starts at height 2.

zebrad/src/components/sync/downloads.rs

this problem is in the state, not the sync code in this PR

We handle request cancellation in two places: before we transition into the AwaitingResponse state, and while we are in AwaitingResponse. We need both places, or else if we started processing a request, we wouldn't process the cancellation until the timeout elapsed. The first is a check that the oneshot is not already canceled. For the second, we wait on a cancellation, either from a timeout or from the tx channel closing.

These messages might be unsolicited, or they might be a response to a request we already canceled. So don't fail the whole connection, just drop the message and move on.

This makes two changes relative to the existing download code: 1. It uses a oneshot to attempt to cancel the download task after it has started; 2. It encapsulates the download creation and cancellation logic into a Downloads struct.

Try to use the better cancellation logic to revert to previous sync algorithm. As designed, the sync algorithm is supposed to proceed by downloading state prospectively and handle errors by flushing the pipeline and starting over. This hasn't worked well, because we didn't previously cancel tasks properly. Now that we can, try to use something in the spirit of the original sync algorithm.

The hedge middleware implements hedged requests, as described in _The Tail At Scale_. The idea is that we auto-tune our retry logic according to the actual network conditions, pre-emptively retrying requests that exceed some latency percentile. This would hopefully solve the problem where our timeouts are too long on mainnet and too slow on testnet.

This lets us keep the main loop simple and just write `continue 'sync;` to keep going.

Using the cancel_handles, we can deduplicate requests. This is important to do, because otherwise when we insert the second cancel handle, we'd drop the first one, cancelling an existing task for no reason.

The timeout behavior in zebra-network is an implementation detail, not a feature of the public API. So it shouldn't be mentioned in the doc comments -- if we want timeout behavior, we have to layer it ourselves.

The previous debug output printed a message that the chain verifier had recieved a block. But this provides no additional information compared to printing no message in chain::Verifier and a message in whichever verifier the block was sent to, since the resulting spans indicate where the block was dispatched. This commit also removes the "unexpected high block" detection; this was an artefact of the original sync algorithm failing to handle block advertisements, but we don't have that problem any more, so we can simplify the code by eliminating that logic.

This reveals that there may be contention in access to the state, as this takes a long time.

There's no reason to return a pre-Buffer'd service (there's no need for internal access to the state service, as in zebra-network), but wrapping it internally removes control of the buffer size from the caller.

The original sync algorithm split the sync process into two phases, one that obtained prospective chain tips, and another that attempted to extend those chain tips as far as possible until encountering an error (at which point the prospective state is discarded and the process restarts). Because a previous implementation of this algorithm didn't properly enforce linkage between segments of the chain while extending tips, sometimes it would get confused and fail to discard responses that did not extend a tip. To mitigate this, a check against the state was added. However, this check can cause stalls while checkpointing, because when a checkpoint is reached we may suddenly need to commit thousands of blocks to the state. Because the sync algorithm now has a a `CheckedTip` structure that ensures that a new segment of hashes actually extends an existing one, we don't need to check against the state while extending a tip, because we don't get confused while interpreting responses. This change results in significantly smoother progress on mainnet.

We already use an actor model for the state service, so we get an ordered sequence of state queries by message-passing. Instead of performing reads in the futures we return, this commit performs them synchronously. This means that all sled access is done from the same task, which (1) might reduce contention (2) allows us to avoid using sled transactions when writing to the state. Co-authored-by: Jane Lusby <jane@zfnd.org> Co-authored-by: Jane Lusby <jane@zfnd.org> Co-authored-by: Deirdre Connolly <deirdre@zfnd.org>

Co-authored-by: teor <teor@riseup.net>

We should error if we notice that we're attempting to download the same blocks multiple times, because that indicates that peers reported bad information to us, or we got confused trying to interpret their responses.

hdevalence · 2020-10-26T19:00:14Z

Rebased again because Dependabot caused a merge conflict. The clippy errors are unrelated to this PR.

hdevalence force-pushed the download-set branch from f072afc to 1ee16c0 Compare September 21, 2020 17:14

teor2345 reviewed Sep 21, 2020

View reviewed changes

zebrad/src/components/sync.rs Outdated Show resolved Hide resolved

teor2345 reviewed Sep 21, 2020

View reviewed changes

zebrad/src/components/sync.rs Outdated Show resolved Hide resolved

hdevalence force-pushed the download-set branch from 1ee16c0 to 923b31c Compare September 22, 2020 18:36

hdevalence marked this pull request as ready for review September 22, 2020 18:41

teor2345 mentioned this pull request Sep 24, 2020

Tracking: sync correctness #884

Closed

44 tasks

dconnolly reviewed Sep 25, 2020

View reviewed changes

zebrad/src/components/sync/downloads.rs Show resolved Hide resolved

dconnolly reviewed Sep 25, 2020

View reviewed changes

teor2345 added this to the Sync and Network milestone Sep 29, 2020

dconnolly mentioned this pull request Oct 1, 2020

Tracking: Testing the Entire Chain #745

Closed

51 tasks

teor2345 modified the milestones: Sync and Network, First Alpha Release Oct 8, 2020

hdevalence mentioned this pull request Oct 21, 2020

WIP: use hedge middleware #1089

Closed

teor2345 reviewed Oct 21, 2020

View reviewed changes

zebrad/src/components/sync.rs Outdated Show resolved Hide resolved

hdevalence force-pushed the download-set branch from 34c16c4 to ef82204 Compare October 21, 2020 22:38

teor2345 suggested changes Oct 21, 2020

View reviewed changes

dconnolly reviewed Oct 22, 2020

View reviewed changes

zebrad/src/components/sync.rs Outdated Show resolved Hide resolved

dconnolly reviewed Oct 22, 2020

View reviewed changes

zebrad/src/components/sync.rs Outdated Show resolved Hide resolved

dconnolly previously approved these changes Oct 22, 2020

View reviewed changes

hdevalence dismissed dconnolly’s stale review via 8276c5f October 22, 2020 05:30

hdevalence force-pushed the download-set branch from ef82204 to 8276c5f Compare October 22, 2020 05:30

hdevalence mentioned this pull request Oct 22, 2020

Sync hang for 22 minutes on mainnet after checkpoint error #1181

Closed

hdevalence force-pushed the download-set branch from 8276c5f to acc6985 Compare October 22, 2020 05:59

hdevalence commented Oct 22, 2020

View reviewed changes

zebrad/src/components/sync.rs Show resolved Hide resolved

teor2345 previously requested changes Oct 23, 2020

View reviewed changes

teor2345 reviewed Oct 24, 2020

View reviewed changes

zebrad/src/components/sync/downloads.rs Outdated Show resolved Hide resolved

dconnolly previously approved these changes Oct 25, 2020

View reviewed changes

This was referenced Oct 26, 2020

Run a block sync in CI with 2 large checkpoints #1193

Merged

Use smaller checkpoints #1206

Merged

hdevalence and others added 18 commits October 26, 2020 11:54

zebrad: rename sync::Error alias to BoxError.

1afbb7b

network: don't fail on unsolicited messages

8080817

These messages might be unsolicited, or they might be a response to a request we already canceled. So don't fail the whole connection, just drop the message and move on.

zebrad: create a Downloads Stream for syncing.

0541fd5

This makes two changes relative to the existing download code: 1. It uses a oneshot to attempt to cancel the download task after it has started; 2. It encapsulates the download creation and cancellation logic into a Downloads struct.

zebrad: unify sync restart logic

6666744

This lets us keep the main loop simple and just write `continue 'sync;` to keep going.

zebrad: correctly handle duplicates in DownloadSet

8494c10

Using the cancel_handles, we can deduplicate requests. This is important to do, because otherwise when we insert the second cancel handle, we'd drop the first one, cancelling an existing task for no reason.

zebrad: remove implementation details from docs

a047bf7

The timeout behavior in zebra-network is an implementation detail, not a feature of the public API. So it shouldn't be mentioned in the doc comments -- if we want timeout behavior, we have to layer it ourselves.

zebrad: eliminate duplicate span in sync

d1555ce

zebrad: add debug message before state check

8a9bbf4

This reveals that there may be contention in access to the state, as this takes a long time.

state: don't pre-buffer the service

6cc3422

There's no reason to return a pre-Buffer'd service (there's no need for internal access to the state service, as in zebra-network), but wrapping it internally removes control of the buffer size from the caller.

sync: add a concurrency limit for block downloads

6f340a7

zebrad: clarify hedge layering motivation

d7c1c66

Co-authored-by: teor <teor@riseup.net>

zebrad: treat duplicate downloads as an error

a689832

We should error if we notice that we're attempting to download the same blocks multiple times, because that indicates that peers reported bad information to us, or we got confused trying to interpret their responses.

hdevalence dismissed dconnolly’s stale review via a689832 October 26, 2020 18:55

hdevalence force-pushed the download-set branch from 7b8f8f7 to a689832 Compare October 26, 2020 18:55

dconnolly approved these changes Oct 26, 2020

View reviewed changes

hdevalence merged commit 4c960c4 into main Oct 26, 2020

hdevalence deleted the download-set branch October 26, 2020 19:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a download set. #1041

Add a download set. #1041

hdevalence commented Sep 9, 2020

hdevalence commented Sep 21, 2020

teor2345 left a comment

hdevalence commented Sep 22, 2020

dconnolly left a comment

teor2345 commented Oct 19, 2020

teor2345 left a comment

teor2345 commented Oct 21, 2020

dconnolly left a comment

hdevalence commented Oct 22, 2020

teor2345 left a comment

hdevalence commented Oct 26, 2020

Add a download set. #1041

Add a download set. #1041

Conversation

hdevalence commented Sep 9, 2020

hdevalence commented Sep 21, 2020

teor2345 left a comment

Choose a reason for hiding this comment

hdevalence commented Sep 22, 2020

dconnolly left a comment

Choose a reason for hiding this comment

teor2345 commented Oct 19, 2020

teor2345 left a comment

Choose a reason for hiding this comment

teor2345 commented Oct 21, 2020

dconnolly left a comment

Choose a reason for hiding this comment

hdevalence commented Oct 22, 2020

teor2345 left a comment

Choose a reason for hiding this comment

hdevalence commented Oct 26, 2020