Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Raft: use a larger initial heartbeat/election timeout #15042

Merged
merged 16 commits into from
Apr 29, 2022

Conversation

ncabatoff
Copy link
Collaborator

Here we leave initialTimeoutMultiplier=1 and don't yet have the newer raft lib allowing us to reload heartbeat/election timeout config.

…lier=1 and don't yet have the newer raft lib allowing us to reload heartbeat/election timeout config.
…t multiplier (3x), and making use of the new raft lib's ability to reload heartbeat/election timeout config after coming up.
@vercel vercel bot temporarily deployed to Preview – vault-storybook April 14, 2022 16:37 Inactive
@vercel vercel bot temporarily deployed to Preview – vault April 14, 2022 16:37 Inactive
@ncabatoff ncabatoff changed the title Demonstrate the issue with a test. Raft: use a larger initial heartbeat/election timeout Apr 21, 2022
@ncabatoff ncabatoff marked this pull request as ready for review April 27, 2022 15:07
@ncabatoff ncabatoff requested a review from HridoyRoy April 27, 2022 15:50
physical/raft/fsm.go Outdated Show resolved Hide resolved
Copy link
Contributor

@HridoyRoy HridoyRoy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm!


leader := state.Leader
deadline := time.Now().Add(2 * time.Minute)
for time.Now().Before(deadline) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand how this test circles back to the addition and use of the reloadable config. Could you clarify a little on this please?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test was failing originally (see first commit). So it's basically just validating that restarting (or sealing/unsealing) a node doesn't provoke new elections, by virtue of the change to the initial timeout. We're using a perf multiplier of 5, and the default timeout is 1s, so by sleeping for 10s between seal/unseal we're hoping to trigger (or not, now that we have the fix) an election if the bug is still present.

physical/raft/raft.go Outdated Show resolved Hide resolved
…onfig can trigger elections, and if all nodes trigger an election at once that can delay things, causing test failures due to timeouts.
…o as high as 48s if you include the random factor, timeouts elsewhere must be increased.
physical/raft/fsm.go Outdated Show resolved Hide resolved
physical/raft/raft.go Outdated Show resolved Hide resolved
physical/raft/raft.go Show resolved Hide resolved
@ncabatoff ncabatoff merged commit dc91661 into main Apr 29, 2022
@ncabatoff ncabatoff deleted the vault-5690-raft-larger-initial-timeout branch April 29, 2022 12:32
@ncabatoff ncabatoff added this to the 1.10.3 milestone May 3, 2022
ncabatoff added a commit that referenced this pull request May 6, 2022
ncabatoff added a commit that referenced this pull request May 6, 2022
ncabatoff added a commit that referenced this pull request May 6, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants