Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bifrost] Design improvements for find_tail #2593

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from
Draft

Conversation

AhmedSoliman
Copy link
Contributor

@AhmedSoliman AhmedSoliman commented Jan 30, 2025

This PR introduces a few changes resulting in significant improvements in failover time, and performance of common operation like find_tail().
The result is failover time that in the hundreds of milliseconds in the happy path and in the order of a couple of seconds in the unhappy path. find_tail() is also now significantly cheaper to run if the sequencer is running, this enables parallelization of find_tail() runs in logs controller (and more frequent as well). The latter comment will be reflected in a separate PR.

Copy link

github-actions bot commented Jan 30, 2025

Test Results

  7 files  ±0    7 suites  ±0   3m 32s ⏱️ -55s
 45 tests  - 2   44 ✅  - 2  1 💤 ±0  0 ❌ ±0 
174 runs   - 8  171 ✅  - 8  3 💤 ±0  0 ❌ ±0 

Results for commit 638c13b. ± Comparison against base commit 3b7fd88.

This pull request removes 2 tests.
dev.restate.sdktesting.tests.AwaitTimeout ‑ timeout(Client)
dev.restate.sdktesting.tests.RawHandler ‑ rawHandler(Client)

♻️ This comment has been updated with latest results.

@AhmedSoliman AhmedSoliman changed the title More logging changes [Bifrost] Design improvements for find_tail Jan 31, 2025
This PR introduces a few changes resulting in significant improvements in failover time, and performance of common operation like find_tail().
The result is failover time that in the hundreds of milliseconds in the happy path and in the order of a couple of seconds in the unhappy path. `find_tail()` is also now significantly cheaper to run if the sequencer is running, this enables parallelization of `find_tail()` runs in logs controller (and more frequent as well). The latter comment will be reflected in a separate PR.
@pcholakov
Copy link
Contributor

Five minute test with random partitions looks good! I will re-run it a few more times to see how it behaves.

https://github.com/restatedev/jepsen/actions/runs/13080247655/job/36501963775

latency-raw

The gaps during the partitions (grey bars) indicate that no processing seems to be happening, and sometimes we don't recover for a couple of on/off cycles - so 15-25s since the first partition event. I added some clients-side timeouts to the test driver to ride out some of the short-term unavailability but I'd want to double check that I'm not starving out Jepsen's worker threads with these. I was under the impression that there is dedicated concurrency per Restate worker node.

Some more analysis required but at first glance it looks like a big improvement over before with no long-term lockups.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants