Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a download set. #1041

Merged
merged 18 commits into from
Oct 26, 2020
Merged

Add a download set. #1041

merged 18 commits into from
Oct 26, 2020

Conversation

hdevalence
Copy link
Contributor

WIP, cannot be tested because state changes broke the sync component. Based on top of #1040.

@hdevalence
Copy link
Contributor Author

Rebased onto main to pick up the state fixes and the improved message handling work.

Copy link
Contributor

@teor2345 teor2345 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good refactor - it should make the sync process easier to understand and maintain.

I'm looking forward to seeing the test results. Let me know when you're ready for a review or testing.

@hdevalence hdevalence marked this pull request as ready for review September 22, 2020 18:41
@hdevalence
Copy link
Contributor Author

Rebased onto main again and redid the cancellation handling in the Connection state machine to cancel pending requests (i.e., handle request cancellations that happen after the request arrives at the connection).

@teor2345 teor2345 mentioned this pull request Sep 24, 2020
44 tasks
Copy link
Contributor

@dconnolly dconnolly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved once the clippy error is fixed

@teor2345 teor2345 added this to the Sync and Network milestone Sep 29, 2020
@dconnolly dconnolly mentioned this pull request Oct 1, 2020
51 tasks
@teor2345
Copy link
Contributor

Could this PR fix the sync hang in #1181?

Copy link
Contributor

@teor2345 teor2345 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code looks good, but I haven't really had a chance to run it.

This PR doesn't contain any tests, what's the plan for testing it?

@teor2345
Copy link
Contributor

PR #1193 adds a limited sync test for the first large checkpoint, but it won't find later stalls or errors.

dconnolly
dconnolly previously approved these changes Oct 22, 2020
Copy link
Contributor

@dconnolly dconnolly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me but I'll wait on merging per others' feedback

@hdevalence
Copy link
Contributor Author

This includes the changes from #1089, in an effort to get the simplified sync logic done in one change. The sync algorithm as designed had some invariants that the different parts were supposed to maintain together, and I think we ran into problems changing some of the parts one at a time. So, this PR tries to clean up the sync algorithm in a coherent way. I tested it on mainnet and testnet. On testnet it makes progress but is very slow, on account of the bad condition of the testnet.

There are some ancillary changes made along the way while testing, mainly to the span and tracing handling. There's a new debug line added before queuing downloads that reveals that a big contributor to latency of the sync while checkpointing mainnet is the state service (checking that new hashes are not already part of the state). I think that this could be safely omitted from the extendtips step, but not obtaintips. (That was part of the original motivation for having those be two separate steps).

Copy link
Contributor

@teor2345 teor2345 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm seeing a lot of errors and state issues after this PR. But the main branch and #1198 work fine.

Oct 23 16:42:13.201  INFO sync:obtain_tips:download_and_verify{hash=block::Hash("00000060a9b03c2f920c3e6e0b92c5b8a253d035222dd7e6a379a126a9682135")}:checkpoint: zebra_consensus::checkpoint: verified checkpoint range block_count=4000 current_range=(Excluded(Height(0)), Included(Height(4000)))
Oct 23 16:42:13.601  WARN sync: zebrad::components::sync: e=Checkpoint(Duplicate { height: Height(215) })
Oct 23 16:42:13.601  INFO sync: zebrad::components::sync: waiting to restart sync timeout=45s
Oct 23 16:42:58.603  INFO sync: zebrad::components::sync: starting sync, obtaining new tips
Oct 23 16:42:58.603  INFO sync:obtain_tips: zebra_state::util: created block locator tip_height=Height(2) min_locator_height=0 locators=[Height(2), Height(1), Height(0)]

Note that 4000 blocks were verified, but the block locator starts at height 2.

dconnolly
dconnolly previously approved these changes Oct 25, 2020
@hdevalence hdevalence dismissed teor2345’s stale review October 26, 2020 18:53

this problem is in the state, not the sync code in this PR

hdevalence and others added 18 commits October 26, 2020 11:54
We handle request cancellation in two places: before we transition into
the AwaitingResponse state, and while we are in AwaitingResponse.  We
need both places, or else if we started processing a request, we
wouldn't process the cancellation until the timeout elapsed.

The first is a check that the oneshot is not already canceled.

For the second, we wait on a cancellation, either from a timeout or from
the tx channel closing.
These messages might be unsolicited, or they might be a response to a
request we already canceled.  So don't fail the whole connection, just
drop the message and move on.
This makes two changes relative to the existing download code:

1.  It uses a oneshot to attempt to cancel the download task after it
    has started;

2.  It encapsulates the download creation and cancellation logic into a
    Downloads struct.
Try to use the better cancellation logic to revert to previous sync
algorithm.  As designed, the sync algorithm is supposed to proceed by
downloading state prospectively and handle errors by flushing the
pipeline and starting over.  This hasn't worked well, because we didn't
previously cancel tasks properly.  Now that we can, try to use something
in the spirit of the original sync algorithm.
The hedge middleware implements hedged requests, as described in _The
Tail At Scale_. The idea is that we auto-tune our retry logic according
to the actual network conditions, pre-emptively retrying requests that
exceed some latency percentile. This would hopefully solve the problem
where our timeouts are too long on mainnet and too slow on testnet.
This lets us keep the main loop simple and just write `continue 'sync;`
to keep going.
Using the cancel_handles, we can deduplicate requests.  This is
important to do, because otherwise when we insert the second cancel
handle, we'd drop the first one, cancelling an existing task for no
reason.
The timeout behavior in zebra-network is an implementation detail, not a
feature of the public API.  So it shouldn't be mentioned in the doc
comments -- if we want timeout behavior, we have to layer it ourselves.
The previous debug output printed a message that the chain verifier had
recieved a block.  But this provides no additional information compared
to printing no message in chain::Verifier and a message in whichever
verifier the block was sent to, since the resulting spans indicate where
the block was dispatched.

This commit also removes the "unexpected high block" detection; this was
an artefact of the original sync algorithm failing to handle block
advertisements, but we don't have that problem any more, so we can
simplify the code by eliminating that logic.
This reveals that there may be contention in access to the state, as
this takes a long time.
There's no reason to return a pre-Buffer'd service (there's no need for
internal access to the state service, as in zebra-network), but wrapping
it internally removes control of the buffer size from the caller.
The original sync algorithm split the sync process into two phases, one
that obtained prospective chain tips, and another that attempted to
extend those chain tips as far as possible until encountering an error
(at which point the prospective state is discarded and the process
restarts).

Because a previous implementation of this algorithm didn't properly
enforce linkage between segments of the chain while extending tips,
sometimes it would get confused and fail to discard responses that did
not extend a tip.  To mitigate this, a check against the state was
added.  However, this check can cause stalls while checkpointing,
because when a checkpoint is reached we may suddenly need to commit
thousands of blocks to the state.  Because the sync algorithm now has a
a `CheckedTip` structure that ensures that a new segment of hashes
actually extends an existing one, we don't need to check against the
state while extending a tip, because we don't get confused while
interpreting responses.

This change results in significantly smoother progress on mainnet.
We already use an actor model for the state service, so we get an
ordered sequence of state queries by message-passing.  Instead of
performing reads in the futures we return, this commit performs them
synchronously.  This means that all sled access is done from the same
task, which

(1) might reduce contention
(2) allows us to avoid using sled transactions when writing to the
state.

Co-authored-by: Jane Lusby <jane@zfnd.org>


Co-authored-by: Jane Lusby <jane@zfnd.org>
Co-authored-by: Deirdre Connolly <deirdre@zfnd.org>
Co-authored-by: teor <teor@riseup.net>
We should error if we notice that we're attempting to download the same
blocks multiple times, because that indicates that peers reported bad
information to us, or we got confused trying to interpret their
responses.
@hdevalence
Copy link
Contributor Author

Rebased again because Dependabot caused a merge conflict. The clippy errors are unrelated to this PR.

@hdevalence hdevalence merged commit 4c960c4 into main Oct 26, 2020
@hdevalence hdevalence deleted the download-set branch October 26, 2020 19:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants