Feature/sort enforcement refactor #5228

mustafasrepo · 2023-02-09T16:57:38Z

Which issue does this PR close?

Closes #.

Rationale for this change

As suggested by @mingmwang in the discussion. Comparing ordering between schemas when there are multiple executors in between them may produce incorrect results. Although this situation may not arise in practice with the current ordering of the optimization rules. We have changed the implementation in case one encounters it (Nobody knows).

What changes are included in this PR?

With this change parallelize sort rule compares the immediate table for converting Coalesce + Executors that do not require SingleDistribution + Sort cascades to Executors that do not require SingleDistribution + Sort + SortPreservingMerge cascade.
We also added a check that the final Sort in the cascade rule is actually a Global sort.

Are these changes tested?

We added a unit test to verify that rule only ends when Sort is a global sort. Also, the existing test checks the sort comparison is done between immediate executors.

Are there any user-facing changes?

ozankabak · 2023-02-09T17:39:14Z

For context, this is the first follow-on PR to address one of the minor issues that were discussed in the original PR (#5171). @mingmwang, PTAL. Thanks!

alamb

I am not an expert in this code, so I would really like @mingmwang to review as well prior to merge.

However, I went over the plan changes and new test and they all make sense to me and the fact that the existing tests all pass means I am 👍 for this PR.

Thank you @mustafasrepo

cc @crepererum who may also be interested in this PR as we are reworking some of our plan construction in IOx as well

alamb · 2023-02-11T13:17:14Z

datafusion/core/src/physical_optimizer/sort_enforcement.rs

+            "    RepartitionExec: partitioning=RoundRobinBatch(10), input_partitions=10",
+            "      RepartitionExec: partitioning=RoundRobinBatch(10), input_partitions=0",
+            "        MemoryExec: partitions=0, partition_sizes=[]",


This plan definitely looks better than the input.

alamb · 2023-02-11T13:18:06Z

datafusion/core/src/physical_optimizer/sort_enforcement.rs

-            "  FilterExec: NOT non_nullable_col@1",
-            "    SortExec: [nullable_col@0 ASC]",
+            "  SortExec: [nullable_col@0 ASC]",
+            "    FilterExec: NOT non_nullable_col@1",


This plan change looks better to me as well (do filtering before sort)

mingmwang · 2023-02-13T02:36:41Z

I will review this PR carefully today.

mingmwang · 2023-02-13T14:40:21Z

The change LGTM.

alamb · 2023-02-13T19:03:06Z

Thanks @mustafasrepo @mingmwang and @ozankabak 🚀

ursabot · 2023-02-13T19:14:25Z

Benchmark runs are scheduled for baseline = 3da7902 and contender = 9565887. 9565887 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

* Remove multilayer chain Ordering comparison for sort parallelize rule * Update tree code * Simplify if condition * Update test * Simplify sort insertion utility to avoid clones --------- Co-authored-by: Mehmet Ozan Kabak <ozankabak@gmail.com>

mustafasrepo added 2 commits February 9, 2023 17:44

Remove multilayer chain Ordering comparison for sort parallelize rule

157e204

Update tree code

e06b6c1

github-actions bot added the core Core DataFusion crate label Feb 9, 2023

mustafasrepo and others added 3 commits February 9, 2023 20:55

Simplify if condition

e0e86c6

Update test

61b679e

Simplify sort insertion utility to avoid clones

4ea09ce

alamb approved these changes Feb 11, 2023

View reviewed changes

alamb merged commit 9565887 into apache:master Feb 13, 2023

mustafasrepo deleted the feature/sort_enforcement_refactor branch March 2, 2023 12:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/sort enforcement refactor #5228

Feature/sort enforcement refactor #5228

mustafasrepo commented Feb 9, 2023

ozankabak commented Feb 9, 2023 •

edited

Loading

alamb left a comment

alamb Feb 11, 2023

alamb Feb 11, 2023

mingmwang commented Feb 13, 2023

mingmwang commented Feb 13, 2023

alamb commented Feb 13, 2023

ursabot commented Feb 13, 2023

Feature/sort enforcement refactor #5228

Feature/sort enforcement refactor #5228

Conversation

mustafasrepo commented Feb 9, 2023

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

ozankabak commented Feb 9, 2023 • edited Loading

alamb left a comment

Choose a reason for hiding this comment

alamb Feb 11, 2023

Choose a reason for hiding this comment

alamb Feb 11, 2023

Choose a reason for hiding this comment

mingmwang commented Feb 13, 2023

mingmwang commented Feb 13, 2023

alamb commented Feb 13, 2023

ursabot commented Feb 13, 2023

ozankabak commented Feb 9, 2023 •

edited

Loading