refactor(meta): reorganize global barrier manager field #18920

wenym1 · 2024-10-15T08:54:25Z

I hereby agree to the terms of the RisingWave Labs, Inc. Contributor License Agreement.

What's changed and what's your intention?

This PR includes some general refactors on the global barrier manager.

First, currently we have GlobalBarrierManager and GlobalBarrierManagerContext. GlobalBarrierManager is actually a owned worker that works in a loop to handle different async barrier event, and GlobalBarrierManagerContext holds some shared structs and is used by both the GlobalBarrierManager worker and other external usages (e.g. DdlController and etc). In this PR, the GlobalBarrierManager worker is renamed to GlobalBarrierWorker, and GlobalBarrierManagerContext is renamed to GlobalBarrierWorkerContext, and will be used by only GlobalBarrierWorker. The external usages will only use a new GlobalBarrierManager, which contains only the fields used by the external usages. With this refactor, only the GlobalBarrierManager has the pub visibility, and the GlobalBarrierWorker and the GlobalBarrierWorkerContext will only be visible within the crate.

Second, in the code of barrier manager, the tuple (prev_epoch, curr_epoch, barrier_kind) appears together in many codes. In this PR, we extract them to a new struct BarrierInfo that holds the 3 fields, and some related code can be less verbose.

Third, the BarrierWorkerState will be moved from GlobalBarrierWorker to CheckpointControl, and the logic of handle_new_barrier in GlobalBarrierWorker is narrowed down to only within CheckpointControl, so that it becomes a method of CheckpointControl only. Besides, previously in GlobalBarrierWorker, we have a separate field pending_non_checkpoint_barriers, which tracks the inflight non-checkpoint barrier before the next checkpoint barrier. In this PR, the field is moved to BarrierWorkerState, and we introduce a method next_barrier_info to generate a BarrierInfo of the next barrier in a single method call. In handle_new_barrier, the BarrierWorkerState becomes immutable after apply_command.

Fourth, the CompletingTask is moved out of CheckpointControl to GlobalBarrierWorker, which decouples the logical CheckpointControl from the physical join handle that calls commit_epoch. In the future PR that supports database isolation, we will have a CheckpointControl per database, but still a single global CompletingTask.

Checklist

I have written necessary rustdoc comments
I have added necessary unit tests and integration tests
I have added test labels as necessary. See details.
I have added fuzzing tests or opened an issue to track them. (Optional, recommended for new SQL features Sqlsmith: Sql feature generation #7934).
My PR contains breaking changes. (If it deprecates some features, please create a tracking issue to remove them in the future).
All checks passed in ./risedev check (or alias, ./risedev c)
My PR changes performance-critical code. (Please run macro/micro-benchmarks and show the results.)

My PR contains critical fixes that are necessary to be merged into the latest release. (Please check out the details)

Documentation

My PR needs documentation updates. (Please use the Release note section below to summarize the impact on users)

Release note

If this PR includes changes that directly affect users or other significant modifications relevant to the community, kindly draft a release note to provide a concise summary of these changes. Please prioritize highlighting the impact these changes will have on users.

gitguardian · 2024-10-15T08:54:32Z

️✅ There are no secrets present in this pull request anymore.

If these secrets were true positive and are still valid, we highly recommend you to revoke them.
Once a secret has been leaked into a git repository, you should consider it compromised, even if it was deleted immediately.
Find here more information about risks.

^{_{🦉 GitGuardian detects secrets in your source code to help developers and security teams secure the modern development process. You are seeing this because you or someone else with access to this repository has authorized GitGuardian to scan your pull request.}}

hzxa21

Generally LGTM. Thanks for the refactoring!

hzxa21 · 2024-10-25T04:15:13Z

src/meta/src/barrier/mod.rs

@@ -762,17 +817,33 @@ impl GlobalBarrierManager {
                        _ => {}
                    }
                }
+                complete_result = self
+                    .completing_task
+                    .next_completed_barrier(


Any specific reason why we change to prefer next_completed_barrier over next_complete_barrier_response in this PR.

Actually both the two methods are still there. The name of next_complete_barrier_response may be confusing, because it actually means the collect barrier response. The reason for the naming previously is because the name of the protobuf message of the response type is CompleteBarrierResponse. I have changed the name to next_collect_barrier_response.

Sorry for the confusion. What I was asking here is not about the naming, but about the priority of the select arm. Prior to this PR, we poll next_completed_barrier first and then next_complete_barrier_response but after this PR we switch the order.

In this PR, the next_completed_barrier takes the mutable reference of control_stream_manager to generate the future (though the mutable reference is used, the generated future won't capture the mutable reference).

In the original poll order, the mutable reference of control_stream_manager will have been captured by the future generated from next_complete_barrier_response , and cannot be used when generating the future of next_completed_barrier .

hzxa21 · 2024-10-25T13:57:17Z

src/meta/src/barrier/mod.rs

+            if !finished_jobs.is_empty()
+                && let Some((_, control_stream_manager)) = &mut context
+            {
+                control_stream_manager.remove_partial_graph(


Are we planning to support resumable MV creation after meta failover? We move remove_partial_graph from after commit_epoch to before commit_epoch, which means that we won't be able to re-send the "finished" epoch to creating MV after recovery. Will this affect failover implementation in the future?

The remove_partial_graph is just a message to clean up the in memory state of the finished partial graph, and will not affect any committed state. After failover, the partial graph can be resumed whenever there are some recovered creating streaming jobs.

src/meta/src/barrier/mod.rs

hzxa21 · 2024-10-25T14:14:38Z

src/meta/src/barrier/mod.rs

    hummock_version_stats: HummockVersionStats,

    create_mview_tracker: CreateMviewProgressTracker,

-    context: GlobalBarrierManagerContext,
+    context: GlobalBarrierWorkerContext,
 }

 impl CheckpointControl {


I think mod.rs is becoming a bit too large, which also affects reability. impl CheckpointControl, impl GlobalBarrierWorker and impl GlobalBarrierManager are somehow mixed together. I can imagine for people who are not familiar with logics here can easily get lost. Now we have clearer cut of responsibility for CheckpointControl and GlobalBarrierWorker, I am thinking maybe we should have separate files for them and maybe for other structs as well. We can do it in a separate PR or along with the future refactoring PR.

Will reorganize the code after implementing the database isolation feature, because there are several pending PR, and if we reorganize the code in this PR, the pending PRs will have many conflicts.

src/meta/src/barrier/state.rs

kwannoel · 2024-10-25T15:09:30Z

The external usages will only use a new GlobalBarrierManager, which contains only the fields used by the external usages

IIUC GlobalBarrierManager will be the external interface used by other services, e.g. DdlController, ScaleServiceImpl? Then GlobalBarrierWorker takes care of the event loop, handling barrier events. If so this change LGTM.

src/meta/src/barrier/mod.rs

github-actions bot added type/refactor ci/run-e2e-single-node-tests labels Oct 15, 2024

wenym1 changed the base branch from yiming/commit-multi-graph-together to yiming/snapshot-backfill-executor-backpressure October 15, 2024 10:41

Base automatically changed from yiming/snapshot-backfill-executor-backpressure to main October 18, 2024 06:23

refactor(meta): reorganize global barrier manager field

6bb2755

wenym1 force-pushed the yiming/global-barrier-manager-refactor branch from 96609a6 to 6bb2755 Compare October 18, 2024 06:41

wenym1 added 3 commits October 20, 2024 02:42

Merge branch 'main' into yiming/global-barrier-manager-refactor

ccf4e02

refine

e4bc464

decouple checkpoint control and completing task

d564eaf

wenym1 added the ci/run-backfill-tests label Oct 21, 2024

wenym1 requested review from hzxa21, yezizp2012, kwannoel and zwang28 October 21, 2024 10:02

wenym1 added 2 commits October 21, 2024 18:24

ack complete in recovery

7b32313

ack in recovery

e8b9526

wenym1 mentioned this pull request Oct 22, 2024

Tracking: support partial checkpoint #14041

Open

25 tasks

hzxa21 approved these changes Oct 25, 2024

View reviewed changes

kwannoel reviewed Oct 25, 2024

View reviewed changes

src/meta/src/barrier/state.rs Outdated Show resolved Hide resolved

kwannoel reviewed Oct 25, 2024

View reviewed changes

src/meta/src/barrier/mod.rs Show resolved Hide resolved

address comment

b382dd4

wenym1 added this pull request to the merge queue Oct 28, 2024

Merged via the queue into main with commit 6263ea6 Oct 28, 2024
31 of 32 checks passed

wenym1 deleted the yiming/global-barrier-manager-refactor branch October 28, 2024 05:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor(meta): reorganize global barrier manager field #18920

refactor(meta): reorganize global barrier manager field #18920

wenym1 commented Oct 15, 2024 •

edited

Loading

gitguardian bot commented Oct 15, 2024 •

edited

Loading

hzxa21 left a comment

hzxa21 Oct 25, 2024

wenym1 Oct 28, 2024

hzxa21 Oct 28, 2024

wenym1 Oct 28, 2024

hzxa21 Oct 25, 2024

wenym1 Oct 28, 2024

hzxa21 Oct 25, 2024

wenym1 Oct 28, 2024

kwannoel commented Oct 25, 2024

refactor(meta): reorganize global barrier manager field #18920

refactor(meta): reorganize global barrier manager field #18920

Conversation

wenym1 commented Oct 15, 2024 • edited Loading

What's changed and what's your intention?

Checklist

Documentation

Release note

gitguardian bot commented Oct 15, 2024 • edited Loading

️✅ There are no secrets present in this pull request anymore.

hzxa21 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kwannoel commented Oct 25, 2024

wenym1 commented Oct 15, 2024 •

edited

Loading

gitguardian bot commented Oct 15, 2024 •

edited

Loading