-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Full rework of the BlockFetch logic for bulk sync mode #1179
base: main
Are you sure you want to change the base?
Conversation
bab4a6a
to
670ae32
Compare
da90985
to
b3434e8
Compare
ouroboros-consensus/src/ouroboros-consensus/Ouroboros/Consensus/Storage/ChainDB/Impl/Types.hs
Outdated
Show resolved
Hide resolved
ouroboros-consensus/src/ouroboros-consensus/Ouroboros/Consensus/Storage/ChainDB/Impl/Types.hs
Outdated
Show resolved
Hide resolved
ouroboros-consensus/src/ouroboros-consensus/Ouroboros/Consensus/Storage/ChainDB/Impl/Types.hs
Outdated
Show resolved
Hide resolved
512f637
to
3d55d7e
Compare
16e5690
to
150ff7e
Compare
150ff7e
to
62605ad
Compare
dd259c8
to
db26d58
Compare
c07561d
to
dadb808
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM: please remove any pending REVIEW
comments.
...os-consensus-diffusion/src/ouroboros-consensus-diffusion/Ouroboros/Consensus/Node/Genesis.hs
Show resolved
Hide resolved
...os-consensus-diffusion/src/ouroboros-consensus-diffusion/Ouroboros/Consensus/Node/Genesis.hs
Outdated
Show resolved
Hide resolved
ouroboros-consensus-diffusion/test/consensus-test/Test/Consensus/Genesis/Setup/GenChains.hs
Outdated
Show resolved
Hide resolved
...nsensus/src/ouroboros-consensus/Ouroboros/Consensus/MiniProtocol/ChainSync/Client/Jumping.hs
Outdated
Show resolved
Hide resolved
...boros-consensus/src/ouroboros-consensus/Ouroboros/Consensus/Storage/ChainDB/Impl/ChainSel.hs
Outdated
Show resolved
Hide resolved
ouroboros-consensus/src/ouroboros-consensus/Ouroboros/Consensus/Storage/ChainDB/Impl/Types.hs
Outdated
Show resolved
Hide resolved
13c93e6
to
b3aa800
Compare
Co-authored-by: Nicolas Bacquey <nicolas.bacquey@tweag.io>
b3aa800
to
7d82cf1
Compare
The releasing branch has been rebased on top of this one and there is now a test failure there. Is this expected? @Niols https://github.com/IntersectMBO/ouroboros-consensus/pull/1314/checks?check_run_id=34400813212 |
Looking at the failure case:
The test looks reasonable, except The genesis window is of size 8, the honest chain has 2 blocks in there while So either the test is flakey and we didn't catch that before and we're unlucky, or it's actually showing a subtle problem somewhere that we are not aware of, or maybe in the release branch some things changed in a way that impacts this. Will have a closer look. |
Checked out this repo on branch
in the development environment and got:
I'm not super up-to-date on how to run this test suite, is there anything I'm missing here? |
Since the previous messages, I have been in contact with @jasagredo on the side. We have tracked down this test failure and concluded the following two things:
For posterity, I will add some details in the following messages, but that was the gist of it. In the coming days, I will push a small fix for this peer simulator bug, probably just by re-enabling timeouts for the |
Now, on
yields:
...and here are the full logs, obtained by enabling tracing.
The story is the following:
@jasagredo and I tracked it down, and it turns out that the ChainSyncClient for the adversary is stuck in the
So far so good. The test fails because nothing unblocks the situation, so we stay stuck at this intersection forever. In fact, the ChainSync mini-protocol of the adversary stays in Now at this point I am not entirely sure whether there was a deep reason to disable those timeouts or if it was a bit of an oversight. Here is the corresponding test case in -- | Test that the leashing attacks do not delay the immutable tip after. The
-- immutable tip needs to be advanced enough when the honest peer has offered
-- all of its ticks.
--
-- See Note [Leashing attacks]
prop_leashingAttackTimeLimited :: Property
prop_leashingAttackTimeLimited =
forAllGenesisTest
(disableCanAwaitTimeout . disableBoringTimeouts <$>
genChains (QC.choose (1, 4)) `enrichedWith` genTimeLimitedSchedule
)
defaultSchedulerConfig
{ scTrace = True
, scEnableLoE = True
, scEnableLoP = True
, scEnableCSJ = True
, scEnableBlockFetchTimeouts = False
}
shrinkPeerSchedules
theProperty Note in particular the disableBoringTimeouts :: GenesisTest blk schedule -> GenesisTest blk schedule
disableBoringTimeouts gt =
gt
{ gtChainSyncTimeouts =
(gtChainSyncTimeouts gt)
{ mustReplyTimeout = Nothing
, idleTimeout = Nothing
}
} The conclusion is double:
|
Come to think of it, I think the mix-up is that “can await” sounds much nicer and less urgent than “must reply”, at least in my head, even though those terms don't per se carry any time meaning. In practice, in |
For the probability of the test failure, I now measured over night 22 test failures over 2,000,000 test cases. This means that it is very unlikely to happen, but that if my fix does pass the same amount of test cases over the next night, then we can be pretty confident that it was a good fix. |
- Always disable `mustReplyTimeout`; explain why - Always disable `idleTimeout`; explain why - Keep the others by default in all the tests This should fix the bug discussed in #1179
- Always disable `mustReplyTimeout`; explain why - Always disable `idleTimeout`; explain why - Keep the others by default in all the tests This should fix the bug discussed in #1179
- Always disable `mustReplyTimeout`; explain why - Always disable `idleTimeout`; explain why - Keep the others by default in all the tests This should fix the bug discussed in #1179
Co-authored-by: Nicolas Frisby <nick.frisby@iohk.io>
- Always disable `mustReplyTimeout`; explain why - Always disable `idleTimeout`; explain why - Keep the others by default in all the tests This should fix the bug discussed in #1179
Integrates a new implementation of the BulkSync mode, where blocks are downloaded from alternative peers as soon as the node has no more blocks to validate while there are longstanding requests in flight.
This PR depends on the new implementation of the BulkSync mode (IntersectMBO/ouroboros-network#4919).
cabal.project
is made to point to a back-port of the BulkSync implementation onouroboros-network-0.16.1.1
.CSJ Changes
CSJ is involved because the new BulkSync mode requires to change the dynamo if it is also serving blocks, and it is not sending them promptly enough. The dynamo choice has an influence in the blocks that are chosen to be downloaded by BlockFetch.
For this sake, b93c379 gives the ability to order the ChainSync clients, so the dynamo role can be rotated among them whenever BlockFetch requests it.
b1c0bf8 provides the implementation of the rotation operation.
BlockFetch tests
c4bfa37 allows to specify in tests in which order to start the peers, which has an effect on what peer is chosen as initial dynamo.
c594c09 in turn adds a new BlockFetch test to show that syncing isn't slowed down by peers that don't send blocks.
Integration of BlockFetch changes
The collection of ChainSync client handles now needs to be passed between BlockFetch and ChainSync so dynamo rotations can be requested by BlockFetch.
The parameter
bfcMaxConcurrencyBulkSync
has been removed since blocks are not coordinated to be downloaded concurrently.These changes are in 6926278.
ChainSel changes
Now BlockFetch requires the ability to detect if ChainSel has run out of blocks to validate. This motivates 73187ba, which implements a mechanism to measure if ChainSel is waiting for more blocks (starves), and determines for how long.
The above change is not sufficient to measure starvation. The queue to send blocks for validation used to allow only for one block to sit in the queue. This would interfere with the ability to measure starvation since BlockFetch would block waiting for the queue to become empty, and the queue would quickly become empty after taking just 1 block. For download delays to be amortized, a larger queue capacity was needed. This is the reason why a fix similar to IntersectMBO/ouroboros-network#2721 is part of 0d3fc28.
Miscellaneous fixes
CSJ jump size adjustment
When syncing from mainnet, we discovered that CSJ wouldn't sync the blocks from the Byron era. This was because the jump size was set to the length of the genesis window of the Shelley era, which is much larger than Byron's. When the jump size is larger than the genesis window, the dynamo will block on the forecast horizon before offering a jump that allows the chain selection to advance. In this case, CSJ and chain selection will deadlock.
For this reason we set the default jump size to the size of Byron's genesis window in 028883a. This didn't show an impact on syncing time in our measures. Future work (as part of deploying Genesis) might involve allowing the jump size to vary between different eras.
GDD rate limit
GDD evaluation showed an overhead of 10% if run after every header arrives via ChainSync. Therefore, in b7fa122 we limited how often it could run, so multiple header arrivals could be handled by a single GDD evaluation.
Candidate fragment comparison in the ChainSync client
We stumbled upon a test case where the candidate fragments of the dynamo and an objector were no longer than the current selection (both peers were adversarial). This was problematic because BlockFetch would refuse to download blocks from these candidates, and ChainSync in turn would wait for the selection to advance in order to download more headers.
The fix in e27a73c is to have the ChainSync client disconnect a peer which is about to block on the forecast horizon if its candidate isn't better than the selection.
Candidate fragment truncations
At the moment, it is possible for a candidate fragment to be truncated by CSJ when a jumper jumps to a point that is not younger than the tip of its current candidate fragment. We encountered tests where the jump point could be so old that it would fall behind the immutable tip, and GDD would ignore the peer when computing the Limit on Eagerness. This in turn would cause the selection to advance into potentially adversarial chains.
The fix in dc5f6f7 is to have GDD never drop candidates. When the candidate does not intersect the current selection, the LoE is not advanced. This is a situation guaranteed to be unblocked by the ChainSync client since it will either disconnect the peer or bring the candidate to intersect with the current selection.