[Bifrost] Design improvements for find_tail #2593

AhmedSoliman · 2025-01-30T18:25:59Z

This PR introduces a few changes resulting in significant improvements in failover time, and performance of common operation like find_tail().
The result is failover time that in the hundreds of milliseconds in the happy path and in the order of a couple of seconds in the unhappy path. find_tail() is also now significantly cheaper to run if the sequencer is running, this enables parallelization of find_tail() runs in logs controller (and more frequent as well). The latter comment will be reflected in a separate PR.

github-actions · 2025-01-30T18:54:30Z

Test Results

7 files ±0 7 suites ±0 3m 32s ⏱️ -55s
45 tests - 2 44 ✅ - 2 1 💤 ±0 0 ❌ ±0
174 runs - 8 171 ✅ - 8 3 💤 ±0 0 ❌ ±0

Results for commit 638c13b. ± Comparison against base commit 3b7fd88.

This pull request removes 2 tests.

dev.restate.sdktesting.tests.AwaitTimeout ‑ timeout(Client)
dev.restate.sdktesting.tests.RawHandler ‑ rawHandler(Client)

♻️ This comment has been updated with latest results.

This PR introduces a few changes resulting in significant improvements in failover time, and performance of common operation like find_tail(). The result is failover time that in the hundreds of milliseconds in the happy path and in the order of a couple of seconds in the unhappy path. `find_tail()` is also now significantly cheaper to run if the sequencer is running, this enables parallelization of `find_tail()` runs in logs controller (and more frequent as well). The latter comment will be reflected in a separate PR.

pcholakov · 2025-01-31T21:00:35Z

Five minute test with random partitions looks good! I will re-run it a few more times to see how it behaves.

https://github.com/restatedev/jepsen/actions/runs/13080247655/job/36501963775

The gaps during the partitions (grey bars) indicate that no processing seems to be happening, and sometimes we don't recover for a couple of on/off cycles - so 15-25s since the first partition event. I added some clients-side timeouts to the test driver to ride out some of the short-term unavailability but I'd want to double check that I'm not starving out Jepsen's worker threads with these. I was under the impression that there is dedicated concurrency per Restate worker node.

Some more analysis required but at first glance it looks like a big improvement over before with no long-term lockups.

AhmedSoliman changed the title ~~More logging changes~~ [Bifrost] Design improvements for find_tail Jan 31, 2025

AhmedSoliman force-pushed the pr2593 branch from ae780f2 to b3b0704 Compare January 31, 2025 18:07

AhmedSoliman force-pushed the pr2593 branch from b3b0704 to 638c13b Compare January 31, 2025 18:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bifrost] Design improvements for find_tail #2593

[Bifrost] Design improvements for find_tail #2593

AhmedSoliman commented Jan 30, 2025 •

edited

Loading

github-actions bot commented Jan 30, 2025 •

edited

Loading

pcholakov commented Jan 31, 2025

[Bifrost] Design improvements for find_tail #2593

Are you sure you want to change the base?

[Bifrost] Design improvements for find_tail #2593

Conversation

AhmedSoliman commented Jan 30, 2025 • edited Loading

github-actions bot commented Jan 30, 2025 • edited Loading

Test Results

pcholakov commented Jan 31, 2025

AhmedSoliman commented Jan 30, 2025 •

edited

Loading

github-actions bot commented Jan 30, 2025 •

edited

Loading