-
Notifications
You must be signed in to change notification settings - Fork 216
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix cft election #1641
Fix cft election #1641
Conversation
fix_cft_election@13369 aka 20200930.8 vs master ewma over 50 builds from 12767 to 13349 |
As Lines 59 to 63 in 3ca2946
|
The solution in this PR seems simpler than what is described in #589 (comment). I think we should record the "why" of the changes in this PR (either in the original ticket or in this PR) as well as the "how" as this is an important change. |
@eddyashton pointed out that this probably isn't right, and that the election should strictly be based on committable_idx/term(committable_idx). I don't have time just now, but I'll update the PR and describe that case later. |
PR is up to date, please review. There are a few more tests I want to add before it's ready. |
Co-authored-by: Eddy Ashton <ashton.eddy@gmail.com>
Looking at the test failure, the election change invalidates our current view history reconstruction logic. In particular, we can no longer assume that commit index is in the term being replayed, and we cannot issue signature transactions until we have "caught up" with previous terms. I am making necessary changes to address this and will change the PR status again once it is ready. |
raft_test is still failing because we replicate transactions marked as committable but that do not deserialise as signatures, so committable indices aren't correct on followers, and now that our term history depends on committable rather than committed indices, it's also inaccurate. Will fix tomorrow. |
DOCTEST_CHECK(r2.get_last_idx() == 3); | ||
} | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Axed in favour of the end to end committable.py for now. The existing raft test scaffolding doesn't have a way to accurately replicate signatures at the moment.
# Suspend three of the backups to prevent commit | ||
backups[1].suspend() | ||
backups[2].suspend() | ||
backups[3].stop() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the difference between suspend()
and stop()
here? (We suspend 2 and stop 1?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suspend allows resuming, stop is killing the node. We do not need 3 anymore, since we want only three nodes to force a unanimous election, and it's unnecessary to operate after that. We only ever start backup 4 because we want f=2.
uc.post("/app/log/private", {"id": 100 + i, "msg": "Hello world"}) | ||
) | ||
|
||
# Wait for a signature to ensure those transactions are committable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This kind of sleep is something we do in several places. It now potentially doesn't do what was intended - if we're in the fancy election state and Raft is not returning a signable
index, we're simply skipping signatures and not producing them at regular intervals.
This is an argument in favour of continuing to produce signatures (making a max-lengthed block of transactions committable) without any global commit claim in them, but that's uglier for ledger parsing. I'm happy to keep the current behaviour (signatures are sometimes skipped) for this PR and discuss alternative approaches offline/in future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes that's true although that will not cause the test to silently pass.
The issue here is that we do not have a way to observe what's committable, the real condition would be, the last of those tx is committable on backup[1]. Even if we shortened the sigx-tx-interval, we wouldn't know that a signature was produced and replicated to backup[1].
…nto fix_cft_election
Co-authored-by: Eddy Ashton <ashton.eddy@gmail.com>
Reconfiguration suite failure in nightly
|
Fixes #589
In a nutshell, the change amounts to: