Moving PipelineFixer above all rules to use ExecutionPlan APIs #5880

metesynnada · 2023-04-05T12:03:33Z

Which issue does this PR close?

Closes #5878.

Rationale for this change

Since some strongly dependent optimizer rules affect each other, rule ordering becomes more important. PipelineFixer (maybe more rules in the future) can change the ExecutionPlan at a level, and the new ExecutionPlan can have a different set of flags (maybe ordering, distribution, or more).

I suggest making the executor changer rules above the rules that fill the sort, distribution, etc.

If the sources are also sorted, we are looking for keeping the order information without adding additional SortExec. However, current planner results

[
    "SymmetricHashJoinExec: join_type=Full, on=[(Column { name: \"a2\", index: 1 }, Column { name: \"a2\", index: 1 })], filter=BinaryExpr { left: BinaryExpr { left: CastExpr { expr: Column { name: \"a1\", index: 0 }, cast_type: Int64, cast_options: CastOptions { safe: false } }, op: Gt, right: BinaryExpr { left: CastExpr { expr: Column { name: \"a1\", index: 1 }, cast_type: Int64, cast_options: CastOptions { safe: false } }, op: Plus, right: Literal { value: Int64(3) } } }, op: And, right: BinaryExpr { left: CastExpr { expr: Column { name: \"a1\", index: 0 }, cast_type: Int64, cast_options: CastOptions { safe: false } }, op: Lt, right: BinaryExpr { left: CastExpr { expr: Column { name: \"a1\", index: 1 }, cast_type: Int64, cast_options: CastOptions { safe: false } }, op: Plus, right: Literal { value: Int64(10) } } } }",
    "  CoalesceBatchesExec: target_batch_size=8192",
    "    RepartitionExec: partitioning=Hash([Column { name: \"a2\", index: 1 }], 8), input_partitions=8",
    "      RepartitionExec: partitioning=RoundRobinBatch(8), input_partitions=1",
    "        CsvExec: files={1 group: [[private/var/folders/rf/dhj0s83d57l2_m51k2dmd_ch0000gn/T/.tmpcRbDJD/left.csv]]}, has_header=false, limit=None, projection=[a1, a2]",
    "  CoalesceBatchesExec: target_batch_size=8192",
    "    RepartitionExec: partitioning=Hash([Column { name: \"a2\", index: 1 }], 8), input_partitions=8",
    "      RepartitionExec: partitioning=RoundRobinBatch(8), input_partitions=1",
    "        CsvExec: files={1 group: [[private/var/folders/rf/dhj0s83d57l2_m51k2dmd_ch0000gn/T/.tmpcRbDJD/right.csv]]}, has_header=false, limit=None, projection=[a1, a2]",
]

and, unfortunately, two consecutive RepartitionExec removes the order information. To prevent this, SHJ particularly sets benefits_from_input_partitioning to false, however, it is ineffective since RepartitionExec::RounRobin is added before HashJoin -> SymmetricHashJoin change.

If we move the Repartition rule below the PipelineFixer, we are able to use SymmetricHashJoinExec 's benefits_from_input_partitioning API effectively.

[
    "SymmetricHashJoinExec: join_type=Full, on=[(Column { name: \"a2\", index: 1 }, Column { name: \"a2\", index: 1 })], filter=BinaryExpr { left: BinaryExpr { left: CastExpr { expr: Column { name: \"a1\", index: 0 }, cast_type: Int64, cast_options: CastOptions { safe: false } }, op: Gt, right: BinaryExpr { left: CastExpr { expr: Column { name: \"a1\", index: 1 }, cast_type: Int64, cast_options: CastOptions { safe: false } }, op: Plus, right: Literal { value: Int64(3) } } }, op: And, right: BinaryExpr { left: CastExpr { expr: Column { name: \"a1\", index: 0 }, cast_type: Int64, cast_options: CastOptions { safe: false } }, op: Lt, right: BinaryExpr { left: CastExpr { expr: Column { name: \"a1\", index: 1 }, cast_type: Int64, cast_options: CastOptions { safe: false } }, op: Plus, right: Literal { value: Int64(10) } } } }",
    "  CoalesceBatchesExec: target_batch_size=8192",
    "    RepartitionExec: partitioning=Hash([Column { name: \"a2\", index: 1 }], 8), input_partitions=1",
    "      CsvExec: files={1 group: [[private/var/folders/rf/dhj0s83d57l2_m51k2dmd_ch0000gn/T/.tmpdTwdrk/left.csv]]}, has_header=false, limit=None, projection=[a1, a2]",
    "  CoalesceBatchesExec: target_batch_size=8192",
    "    RepartitionExec: partitioning=Hash([Column { name: \"a2\", index: 1 }], 8), input_partitions=1",
    "      CsvExec: files={1 group: [[private/var/folders/rf/dhj0s83d57l2_m51k2dmd_ch0000gn/T/.tmpdTwdrk/right.csv]]}, has_header=false, limit=None, projection=[a1, a2]",
]

What changes are included in this PR?

Optimizer reorders to use the benefits_from_input_partitioning API. If we call PipelineFixer above (almost) everything, we can leverage the changed executor APIs in the optimizer.

Are these changes tested?

Yes.

Are there any user-facing changes?

No.

…HJ code simplifications

datafusion/core/src/physical_plan/joins/symmetric_hash_join.rs

alamb

makes sense to me -- thank you @metesynnada

metesynnada and others added 14 commits March 22, 2023 15:58

Increase optimizer performance

b25bc46

Config added.

b5eb32b

Simplifications and comment improvements

575a6d7

More simplifications

05f768c

Revamping tests for unbounded-unbounded cases.

90d82df

Review code

8886457

Move SHJ suitability from PipelineFixer to PipelineChecker, further S…

36d450e

…HJ code simplifications

Merge branch 'main' into performance/remove-enforcesorting

5df5e05

Make streaming executors concurrent with optimizer re-order

541eabc

Update symmetric_hash_join.rs

e962d65

Fix target partitions into 8

65552ac

Merge branch 'main' into feature/repartition-for-stream

7eade2a

Merge branch 'main' into feature/repartition-for-stream

4bf9eef

Update symmetric_hash_join.rs

b8bdf58

github-actions bot added the core Core DataFusion crate label Apr 5, 2023

mustafasrepo reviewed Apr 6, 2023

View reviewed changes

datafusion/core/src/physical_plan/joins/symmetric_hash_join.rs Show resolved Hide resolved

mustafasrepo reviewed Apr 6, 2023

View reviewed changes

datafusion/core/src/physical_plan/joins/symmetric_hash_join.rs Outdated Show resolved Hide resolved

Update symmetric_hash_join.rs

5c85124

alamb approved these changes Apr 6, 2023

View reviewed changes

alamb merged commit 4f40070 into apache:main Apr 6, 2023

metesynnada deleted the feature/repartition-for-stream branch April 7, 2023 21:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Moving PipelineFixer above all rules to use ExecutionPlan APIs #5880

Moving PipelineFixer above all rules to use ExecutionPlan APIs #5880

metesynnada commented Apr 5, 2023 •

edited

Loading

alamb left a comment

Moving PipelineFixer above all rules to use ExecutionPlan APIs #5880

Moving PipelineFixer above all rules to use ExecutionPlan APIs #5880

Conversation

metesynnada commented Apr 5, 2023 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

alamb left a comment

Choose a reason for hiding this comment

metesynnada commented Apr 5, 2023 •

edited

Loading