feat(meta): iterative streaming scheduler (part 2) #7659

BugenZhao · 2023-02-02T06:27:17Z

I hereby agree to the terms of the RisingWave Labs, Inc. Contributor License Agreement.

This PR is big but it should be very readable. Please follow the comments to get it reviewed. Thanks! 🥰🥵

What's changed and what's your intention?

This PR fully utilizes the new scheduler in the process of creating streaming jobs, simplifying and generalizing a lot of Chain-related logic.

Extends the scheduler to also emit which parallel unit a singleton fragment should lie on. Previously we only tell that this fragment should be a singleton but did not decide whether it's required to be scheduled to a specific parallel unit. This is insufficient as we need to distinguish the No-Shuffle singletons (like MV on singleton MV), and normal singletons that can be randomly scheduled.
Use the CompleteFragmentGraph for building actors. The complete graph contains both the fragment graph of the current job, and the upstream fragments that are connected to the current job (that is, upstream Materialize). So when we're visiting the edge between the upstream Materialize and current Chain, we can naturally fill the distribution info of the Chain and build the new dispatchers for Materialize as usual. By doing this, we can totally remove the patches like resolve_chain_node and make the whole progress much clearer.
Remove the old Scheduler and related fields like same_worker_node or colocated_actor. As we have enough knowledge about the distribution of each fragment (including Chain!) when building the actor graph, now the ActorGraph can be complete and immutable once built. So at the same time, we can make the Context immutable as well, which improves the readability a lot.

Checklist

I have written necessary rustdoc comments
I have added necessary unit tests and integration tests
I have demonstrated that backward compatibility is not broken by breaking changes and created issues to track deprecated features to be removed in the future. (Please refer the issue)
All checks passed in ./risedev check (or alias, ./risedev c)

Refer to a related PR or issue link (optional)

Signed-off-by: Bugen Zhao <i@bugenzhao.com>

BugenZhao · 2023-02-06T05:22:28Z

proto/stream_plan.proto

@@ -356,8 +359,6 @@ message ChainNode {
  // large. However, in some cases, e.g., shared state, the barrier cannot be rearranged in ChainNode.
  // ChainType is used to decide which implementation for the ChainNode.
  ChainType chain_type = 4;
-  // Whether to place this chain on the same worker node as upstream actors.
-  bool same_worker_node = 5;


We remove this as it's actually always true, due to NoShuffle exchange.

BugenZhao · 2023-02-06T05:24:36Z

proto/stream_plan.proto

@@ -577,8 +576,6 @@ message StreamActor {
  // It is painstaking to traverse through the node tree and get upstream actor id from the root StreamNode.
  // We duplicate the information here to ease the parsing logic in stream manager.
  repeated uint32 upstream_actor_id = 6;
-  // Placement rule for actor, need to stay on the same node as a specified upstream actor.
-  ColocatedActorId colocated_upstream_actor_id = 7;


The field is here because the StreamActor message is a protocol between the old graph builder and the old scheduler. As we remove the old scheduler and schedule fragments ahead of time, we can remove this.

BugenZhao · 2023-02-06T05:25:12Z

proto/stream_plan.proto

    // Dispatch strategy for the fragment.
    DispatchStrategy dispatch_strategy = 1;
-    // Whether the two linked nodes should be placed on the same worker node
-    bool same_worker_node = 2;


This should be derived from the NoShuffle strategy.

BugenZhao · 2023-02-06T05:26:42Z

src/meta/src/model/stream.rs

    ) -> Self {
+        let actor_status = actor_locations


We directly fill the Inactive state on initialization.

BugenZhao · 2023-02-06T05:27:44Z

src/meta/src/model/stream.rs

@@ -264,25 +284,6 @@ impl TableFragments {
        .collect()
    }

-    /// Returns fragments that contains Chain node.
-    pub fn chain_fragment_ids(&self) -> HashSet<FragmentId> {


A lot of code in this file can be removed due to no more need for resolve_chain.

BugenZhao · 2023-02-06T05:46:16Z

src/meta/src/stream/stream_graph.rs

+        let (distribution, actor_ids) = match current_fragment {
+            // For building fragments, we need to generate the actor builders.
+            EitherFragment::Building(current_fragment) => {


It's possible we are visiting the upstream Materialize (existing) fragment now, as the topological order is on the complete graph. So we need to distinguish between them.

BugenZhao · 2023-02-06T05:48:43Z

src/meta/src/stream/stream_graph.rs

-    // TODO: remove this after scheduler refactoring.
-    pub fn into_inner(self) -> StreamFragmentGraph {
-        self.graph
+    /// Generate topological order of **all** fragments in this graph, including the ones that are


The interfaces below used to be the FragmentGraph's. Now it's modified to operate on the CompleteGraph, where the existing fragments will be considered.

BugenZhao · 2023-02-06T05:49:54Z

src/meta/src/stream/stream_graph/schedule.rs

+    /// This fragment is singleton, and should be scheduled to the default parallel unit.
+    DefaultSingleton,
+    /// This fragment is hash-distributed, and should be scheduled by the default hash mapping.
+    DefaultHash,


Instead of providing the default distribution to the datalog, we make the result more explicit and let the caller apply it.

BugenZhao · 2023-02-06T05:51:35Z

src/meta/src/stream/stream_manager.rs

+    pub building_locations: Locations,
+
+    /// The locations of the existing actors, essentially the upstream mview actors to update.
+    pub existing_locations: Locations,


This is derived from the ActorGraphBuilder, and can be directly used for RPCs like building hanging channels or broadcast actor info.

BugenZhao · 2023-02-06T05:54:22Z

src/meta/src/stream/stream_graph.rs

@@ -14,32 +14,37 @@



It'll be better to review this file from ~L970 (stream graph), then ~L600 (actor graph), then from the beginning (actor builder). 🥺

I'm going to split it into multiple files in next PRs.

BugenZhao · 2023-02-06T05:57:32Z

I'm still fixing the unit tests and some minor issues, while the e2e runs pretty well. Feel free to review the main part of this huge PR. 🥵

Signed-off-by: Bugen Zhao <i@bugenzhao.com>

codecov · 2023-02-07T06:31:34Z

Codecov Report

Merging #7659 (96df668) into main (adae2d2) will increase coverage by 0.07%.
The diff coverage is 75.15%.

@@            Coverage Diff             @@
##             main    #7659      +/-   ##
==========================================
+ Coverage   71.62%   71.69%   +0.07%     
==========================================
  Files        1098     1098              
  Lines      175119   174524     -595     
==========================================
- Hits       125421   125126     -295     
+ Misses      49698    49398     -300

Flag	Coverage Δ
rust	`71.69% <75.15%> (+0.07%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
src/common/src/hash/consistent_hash/mapping.rs	`84.68% <0.00%> (-3.82%)`	⬇️
...ntend/src/optimizer/plan_node/stream_index_scan.rs	`94.48% <ø> (-0.04%)`	⬇️
...ntend/src/optimizer/plan_node/stream_table_scan.rs	`97.02% <ø> (-0.02%)`	⬇️
...tend/src/stream_fragmenter/graph/fragment_graph.rs	`96.87% <ø> (-0.05%)`	⬇️
src/frontend/src/stream_fragmenter/mod.rs	`73.56% <ø> (-0.16%)`	⬇️
...ontend/src/stream_fragmenter/rewrite/delta_join.rs	`97.89% <ø> (-0.06%)`	⬇️
src/meta/src/manager/catalog/fragment.rs	`30.36% <ø> (-0.90%)`	⬇️
src/meta/src/manager/catalog/user.rs	`93.82% <ø> (ø)`
src/meta/src/manager/id.rs	`92.85% <ø> (ø)`
src/meta/src/rpc/service/ddl_service.rs	`0.00% <0.00%> (ø)`
... and 22 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

chenzl25 · 2023-02-07T06:31:57Z

src/meta/src/stream/stream_graph/schedule.rs

+                    Result::DefaultSingleton => {
+                        Distribution::Singleton(self.default_singleton_parallel_unit)
+                    }


Should we use different parallel units for a streaming job with multiple singleton fragments instead of a fixed one? For example create materialized view v as select count(*) from t1 union all select count(*) from t2 union all select count(*) from t3; this sql has 3 singleton fragments.

Good catch. I've also thought of this but I'm still trying to find a way to represent it in crepe, as the current rules only propagate the requirement through no-shuffle, but don't go back and check whether the final result is still correct. If we assign different parallel units to them, it's possible that some no-shuffles are violated... Maybe we need a SameDist relation?

It seems like a tough task to assign different parallel units for different singleton fragments with the NoShuffle restrictions, because when we assign a parallel unit to a singleton fragment, we need to propagate it to other fragment connected by NoShuffle edge and after that we can assign another different parallel unit to another singleton fragment, but crepe might can not guarantee this assign ordering. It could assign two different parallel units before propagation and then violate the NoShuffle restrictions.

Or we can use a crepe schedule loop (instead of one crepe schedule) to assign a parallel unit to Result::DefaultSingleton and propagate it through the NoShuffle edge. Once all fragments get their specific parallel unit we are done, but it would be more complicated.

Agree. For now, assuming that the singleton executors won't process too much data and there won't be too many singletons in a graph, the current implementation just works.

As long as we extract the scheduling of the whole graph as a separate step like this, we can benefit from this global knowledge of scheduling and make the following steps easier to implement. The algorithm can be optimized or replaced if necessary in the future.

Signed-off-by: Bugen Zhao <i@bugenzhao.com>

chenzl25

Looks so great to me!!! The current scheduler is much clearer and easier to maintain than the previous one. Your work is excellent, thank you.

yezizp2012

Looks so successful to me!!! What an excellent work!!!

This PR extends the new actor graph builder introduced in #7659 for replacing table plan, and adds the procedure of preparing the `TableFragments`. Note that the RPC with the compute nodes is not implemented so this is not utilized by the frontend yet. Building an actor graph for schema change can be similar and symmetric to the MV on MV. - For MV on MV, we have existing upstream `Materialize` fragments that give the requirements to the distribution of current `Chain`. We'll generate new dispatchers for the upstream `Materialize` in the result. - For replacing table plan for schema change, we (may) have existing downstream `Chain` fragments that give the requirements to the distribution of current `Materialize`. We'll generate updates for the merger for the downstream `Chain` in the result. Related: - #7908 Approved-By: chenzl25 Approved-By: yezizp2012

BugenZhao added 6 commits February 1, 2023 22:29

directly fill mapping

508cb0a

singleton req

c0c9edb

fix link id

ad3e901

Signed-off-by: Bugen Zhao <i@bugenzhao.com>

move rewrite to actor builder

c5cd76b

Signed-off-by: Bugen Zhao <i@bugenzhao.com>

minor refactor to rewrite

303bffd

Signed-off-by: Bugen Zhao <i@bugenzhao.com>

pass locations

9c758f6

Signed-off-by: Bugen Zhao <i@bugenzhao.com>

github-actions bot added the type/feature label Feb 2, 2023

BugenZhao added 17 commits February 2, 2023 14:32

fix test

e551e9e

Signed-off-by: Bugen Zhao <i@bugenzhao.com>

use option for parallelism

869fe03

Signed-off-by: Bugen Zhao <i@bugenzhao.com>

new edge

29cd9eb

Signed-off-by: Bugen Zhao <i@bugenzhao.com>

use complete graph for building

42c3a7f

Signed-off-by: Bugen Zhao <i@bugenzhao.com>

record location

848c41d

Signed-off-by: Bugen Zhao <i@bugenzhao.com>

rewrite chain

7653b39

Signed-off-by: Bugen Zhao <i@bugenzhao.com>

cleanup unncessary code

90f996e

Signed-off-by: Bugen Zhao <i@bugenzhao.com>

more cleanups

aac75c1

Signed-off-by: Bugen Zhao <i@bugenzhao.com>

distinguish external locations

d8a5b91

Signed-off-by: Bugen Zhao <i@bugenzhao.com>

put stuff into context

12cdb91

Signed-off-by: Bugen Zhao <i@bugenzhao.com>

initialize actor status when create table fragments

1e271e0

Signed-off-by: Bugen Zhao <i@bugenzhao.com>

add comments

3ff8792

Signed-off-by: Bugen Zhao <i@bugenzhao.com>

add more doc and comments

d357b7b

Signed-off-by: Bugen Zhao <i@bugenzhao.com>

add more docs

bd29344

Signed-off-by: Bugen Zhao <i@bugenzhao.com>

remove same worker node

b54f64c

Signed-off-by: Bugen Zhao <i@bugenzhao.com>

fix rewrite other nodes

00a9193

Signed-off-by: Bugen Zhao <i@bugenzhao.com>

Merge remote-tracking branch 'origin/main' into bz/new-scheduler-part-2

92acf26

Signed-off-by: Bugen Zhao <i@bugenzhao.com>

BugenZhao commented Feb 6, 2023

View reviewed changes

BugenZhao marked this pull request as ready for review February 6, 2023 05:56

BugenZhao requested review from yezizp2012 and chenzl25 February 6, 2023 05:56

minor fixes

43251cf

Signed-off-by: Bugen Zhao <i@bugenzhao.com>

BugenZhao added 4 commits February 6, 2023 14:42

fix default parallelism

8251598

Signed-off-by: Bugen Zhao <i@bugenzhao.com>

fix unit tests

30585a1

Signed-off-by: Bugen Zhao <i@bugenzhao.com>

fix unit test

96697b5

Signed-off-by: Bugen Zhao <i@bugenzhao.com>

Merge remote-tracking branch 'origin/main' into bz/new-scheduler-part-2

3a640a3

chenzl25 reviewed Feb 7, 2023

View reviewed changes

sort the parallel units for better locality

9e1ff93

Signed-off-by: Bugen Zhao <i@bugenzhao.com>

chenzl25 approved these changes Feb 7, 2023

View reviewed changes

yezizp2012 approved these changes Feb 7, 2023

View reviewed changes

BugenZhao added the mergify/can-merge label Feb 7, 2023

Merge branch 'main' into bz/new-scheduler-part-2

96df668

mergify bot merged commit 9068a4e into main Feb 7, 2023

mergify bot deleted the bz/new-scheduler-part-2 branch February 7, 2023 10:13

This was referenced Feb 9, 2023

refactor(meta): separate stream graph building into modules & add unit tests for scheduler #7807

Merged

feat(meta): support schema change in actor graph builder #7878

Merged

BugenZhao mentioned this pull request Mar 14, 2023

refactor(fragmenter): remove is_singleton workarounds on Chain #8536

Merged

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(meta): iterative streaming scheduler (part 2) #7659

feat(meta): iterative streaming scheduler (part 2) #7659

BugenZhao commented Feb 2, 2023 •

edited

Loading

BugenZhao Feb 6, 2023

BugenZhao Feb 6, 2023

BugenZhao Feb 6, 2023

BugenZhao Feb 6, 2023

BugenZhao Feb 6, 2023

BugenZhao Feb 6, 2023

BugenZhao Feb 6, 2023

BugenZhao Feb 6, 2023

BugenZhao Feb 6, 2023

BugenZhao Feb 6, 2023

BugenZhao commented Feb 6, 2023

codecov bot commented Feb 7, 2023 •

edited

Loading

chenzl25 Feb 7, 2023 •

edited

Loading

BugenZhao Feb 7, 2023

chenzl25 Feb 7, 2023

chenzl25 Feb 7, 2023

BugenZhao Feb 7, 2023

chenzl25 left a comment

yezizp2012 left a comment •

edited

Loading

feat(meta): iterative streaming scheduler (part 2) #7659

feat(meta): iterative streaming scheduler (part 2) #7659

Conversation

BugenZhao commented Feb 2, 2023 • edited Loading

What's changed and what's your intention?

Checklist

Refer to a related PR or issue link (optional)

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

BugenZhao commented Feb 6, 2023

codecov bot commented Feb 7, 2023 • edited Loading

Codecov Report

chenzl25 Feb 7, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chenzl25 left a comment

Choose a reason for hiding this comment

yezizp2012 left a comment • edited Loading

Choose a reason for hiding this comment

BugenZhao commented Feb 2, 2023 •

edited

Loading

codecov bot commented Feb 7, 2023 •

edited

Loading

chenzl25 Feb 7, 2023 •

edited

Loading

yezizp2012 left a comment •

edited

Loading