Update `ExecutionPlan` to know about sortedness and repartitioning optimizer pass respect the invariants #1776

alamb · 2022-02-07T20:55:55Z

~~Draft until I have completed testing downstream in IOx~~

Which issue does this PR close?

Closes #424 ( Design how to respect output stream ordering )

Along with #1732, fixes #423 (the last part).

Rationale for this change

Repartitioning the input to an operator that relies on its input to be sorted is incorrect as the repartitioning will intermix the partitions and effectively "unsort" the input stream

We found this in IOx here https://github.com/influxdata/influxdb_iox/pull/3633#issuecomment-1030126757

Here is a picture showing the problem:

    ┌─────────────────────────────────┐
    │                                 │
    │     SortPreservingMergeExec     │
    │                                 │
    └─────────────────────────────────┘
              ▲      ▲       ▲            Input may not
              │      │       │             be sorted!
      ┌───────┘      │       └───────┐
      │              │               │
      │              │               │
┌───────────┐  ┌───────────┐   ┌───────────┐
│           │  │           │   │           │
│ batch A1  │  │ batch A3  │   │ batch B3  │
│           │  │           │   │           │
├───────────┤  ├───────────┤   ├───────────┤
│           │  │           │   │           │
│ batch B2  │  │ batch B1  │   │ batch A2  │
│           │  │           │   │           │
└───────────┘  └───────────┘   └───────────┘
      ▲              ▲               ▲
      │              │               │
      └─────────┐    │    ┌──────────┘
                │    │    │                  Outputs
                │    │    │                batches are
    ┌─────────────────────────────────┐   repartitioned
    │       RepartitionExec(3)        │    and may not
    │           RoundRobin            │   remain sorted
    │                                 │
    └─────────────────────────────────┘
                ▲         ▲
                │         │                Inputs are
          ┌─────┘         └─────┐            sorted
          │                     │
          │                     │
          │                     │
    ┌───────────┐         ┌───────────┐
    │           │         │           │
    │ batch A1  │         │ batch B1  │
    │           │         │           │
    ├───────────┤         ├───────────┤
    │           │         │           │
    │ batch A2  │         │ batch B2  │
    │           │         │           │
    ├───────────┤         ├───────────┤
    │           │         │           │
    │ batch A3  │         │ batch B3  │
    │           │         │           │
    └───────────┘         └───────────┘


     Sorted Input          Sorted Input
           A                     B

The streams need to remain the way they were

┌─────────────────────────────────┐
│                                 │
│     SortPreservingMergeExec     │
│                                 │
└─────────────────────────────────┘
            ▲         ▲
            │         │         Inputs are
      ┌─────┘         └─────┐   sorted, as
      │                     │    required
      │                     │
      │                     │
┌───────────┐         ┌───────────┐
│           │         │           │
│ batch A1  │         │ batch B1  │
│           │         │           │
├───────────┤         ├───────────┤
│           │         │           │
│ batch A2  │         │ batch B2  │
│           │         │           │
├───────────┤         ├───────────┤
│           │         │           │
│ batch A3  │         │ batch B3  │
│           │         │           │
└───────────┘         └───────────┘


 Sorted Input          Sorted Input
       A                     B

What changes are included in this PR?

Add several "metadata" functions to ExecutionPlan that describe its sortedness and the invariants required for its input
Teach the repartitioning optimizer pass to respect the invariants

Are there any user-facing changes?

Yes: All ExecutionPlans are now required to implement output_ordering as described by @andygrove here #424 (comment)

The rationale for not providing a default implementation (None) was to force anyone who impl ExecutionPlan to think about sort orders. If they do not (very!) subtle bugs are possible as DataFusion starts to rely more on sortedness for optimizations

cc @tustvold @Dandandan

…ing when that would result in incorrect behavior

alamb · 2022-02-07T21:00:03Z

datafusion/src/physical_optimizer/repartition.rs

 fn optimize_partitions(
    target_partitions: usize,
    plan: Arc<dyn ExecutionPlan>,
-    should_repartition: bool,
+    can_reorder: bool,


Here is the change to the repartition logic to not repartition if it would produce incorrect answers

alamb · 2022-02-07T21:00:30Z

datafusion/src/physical_optimizer/repartition.rs

    #[test]
    fn added_repartition_to_single_partition() -> Result<()> {
-        let optimizer = Repartition {};
+        let plan = hash_aggregate(parquet_exec());


I cleaned up the tests here to reduce the ceremony of invoking the optimizer. The plans are all the same

alamb · 2022-02-07T21:00:59Z

datafusion/src/physical_optimizer/repartition.rs

        Ok(())
    }

    #[test]
-    fn repartition_ignores_limit() -> Result<()> {
-        let optimizer = Repartition {};
+    fn repartition_unsorted_limit() -> Result<()> {


new plans showing that data isn't repartitioned below limits if sorts are present

These tests are 👌

alamb · 2022-02-07T21:01:24Z

datafusion/src/physical_optimizer/repartition.rs

+            "GlobalLimitExec: limit=100",
+            "LocalLimitExec: limit=100",
+            "FilterExec: c1@0",
+            // data is sorted so can't repartition here even though


However, once you put a sort here then repartitioning can't happen without potentially getting wrong results

alamb · 2022-02-07T21:02:05Z

datafusion/src/physical_plan/analyze.rs

@@ -82,6 +83,10 @@ impl ExecutionPlan for AnalyzeExec {
        Partitioning::UnknownPartitioning(1)
    }

+    fn output_ordering(&self) -> Option<&[PhysicalSortExpr]> {


Having to sprinkle output_ordering around was annoying -- but I think it may be worth it to try and avoid some nasty bugs.

Agree, makes sense to be explicit

alamb · 2022-02-07T21:02:43Z

datafusion/src/physical_plan/limit.rs

@@ -300,11 +335,6 @@ impl ExecutionPlan for LocalLimitExec {
            _ => Statistics::default(),
        }
    }
-
-    fn should_repartition_children(&self) -> bool {


this is effectively renamed to benefits_from_input_partitioning

alamb · 2022-02-07T21:03:14Z

datafusion/src/physical_plan/mod.rs

@@ -147,24 +147,59 @@ pub trait ExecutionPlan: Debug + Send + Sync {
        Distribution::UnspecifiedDistribution
    }

-    /// Returns `true` if the direct children of this `ExecutionPlan` should be repartitioned
-    /// to introduce greater concurrency to the plan
+    /// Returns `true` if this operator relies on its inputs being


Here is the new API for ExecutionPlan that signal how / when repartitioning occurs

tustvold · 2022-02-08T13:51:30Z

datafusion/src/physical_plan/windows/window_agg_exec.rs

@@ -114,6 +115,14 @@ impl ExecutionPlan for WindowAggExec {
        self.input.output_partitioning()
    }

+    fn maintains_input_order(&self) -> bool {


Shouldn't this also have relies_on_input_order?

tustvold

I've spent a depressingly long time staring at this, and I think it is correct - nice work 👍.

However, I am a little bit uncertain about output_ordering. My understanding is it is present to allow repartitioning of branches with order-sensitive operators, such as limit, but no explicit order.

I worry that this will lead two classes of hard to track down bugs:

ExecutionPlan that incorrectly report None for output_ordering
Plans that make assumptions about ordering without encoding this into Datafusion

An example of 2. might be a plan that scans a sorted file, without the file itself exposing to DataFusion that it is sorted.

I guess I just wonder if this is really worth the potential headaches 😅

datafusion/src/physical_plan/mod.rs

tustvold · 2022-02-08T14:02:59Z

datafusion/src/physical_plan/limit.rs

@@ -232,6 +249,24 @@ impl ExecutionPlan for LocalLimitExec {
        self.input.output_partitioning()
    }

+    fn relies_on_input_order(&self) -> bool {
+        self.input.output_ordering().is_some()


This feels like an optimization that really belongs in the Repartition optimizer, namely that if the children of a plan don't have a sort order, you can freely repartition them even if the parent relies_on_input_order.

tustvold · 2022-02-08T14:05:46Z

datafusion/src/physical_optimizer/repartition.rs

+                (false, false) => {
+                    // `plan` may reorder the input itself, so no need
+                    // to preserve the order of any children
+                    true


I think this has lost the requires_single_partition case, that being said I'm not sure why this matters? A CoalesceBatches will just be inserted? Perhaps would_benefit should be set to false if this requires a single partition, as this won't propagate beyond the direct children? 🤔

Interestingly when I add the requres_single_partition case here I get failures with the tests on

[ "SortPreservingMergeExec: [c1@0 ASC]", "SortExec: [c1@0 ASC]", "ProjectionExec: expr=[c1@0 as c1]", "RepartitionExec: partitioning=RoundRobinBatch(10)", "ParquetExec: limit=None, partitions=[x]", ] actual: [ "SortPreservingMergeExec: [c1@0 ASC]", "SortExec: [c1@0 ASC]", "ProjectionExec: expr=[c1@0 as c1]", "ParquetExec: limit=None, partitions=[x]", ]

Aka repartitioning doesn't two levels down.

So rather than intermix the "should we bother repartitioning" with the "would it produce wrong answers" I simply removed the check for the required input partitioning and it is now included in the default "benefits from repartitioning check"

tustvold · 2022-02-08T14:17:03Z

datafusion/src/physical_optimizer/repartition.rs

+/// can not be reordered as as something upstream is relying on that order
+///
+/// If 'would_benefit` is false, the upstream operator doesn't
+///  benefit from additional reordering


Suggested change

/// benefit from additional reordering

/// benefit from additional partitioning

tustvold · 2022-02-08T14:32:05Z

datafusion/src/physical_plan/mod.rs

+    /// The default implementation returns `true`
+    fn benefits_from_input_partitioning(&self) -> bool {
+        // give me MOAR CPUs
+        true


Why did you remove the required_child_distribution?

I was trying to separate the notions of correctness from possible optimizations; However when I type out the rationale it doesn't really hold up; I will put it back.

tustvold · 2022-02-08T14:36:03Z

datafusion/src/physical_optimizer/repartition.rs

@@ -36,33 +128,70 @@ impl Repartition {
    }
 }

+/// Recursively visits all `plan`s puts and then optionally adds a


For my own understanding I'm going to write out what this does.

It does a depth first scan of the tree, and repartitions any plan that:

Has less than the desired number of partitions

Has a direct parent that benefits_from_input_partitioning

Does not have a parent that relies_on_input_order unless there is an intervening node that does not maintain_input_order

Has a direct parent that benefits_from_input_partitioning

I think this is Has any parent that benefits_from_input_partitioning` unless there is an intervening node

otherwise yes. I will add this summary as a comment. Thank you

tustvold · 2022-02-08T14:37:27Z

datafusion/src/physical_optimizer/repartition.rs

        Ok(())
    }

    #[test]
-    fn repartition_ignores_limit() -> Result<()> {
-        let optimizer = Repartition {};
+    fn repartition_unsorted_limit() -> Result<()> {


These tests are 👌

tustvold · 2022-02-08T14:50:07Z

datafusion/src/physical_plan/mod.rs

+    /// such as automatically repartitioning correctly.
+    ///
+    /// The default implementation returns `false`
+    fn maintains_input_order(&self) -> bool {


I spent a long time trying to understand why there is both this and output_ordering and it is because this indicates if the operator preserves the order, not if that order is actually sorted 😅

Yes. I will make this clearer in the comments

xudong963

I have carefully looked at this PR and its related issues. I think some related history issues can be solved in the ticket. BTW, the test is solid! 👍 @alamb

xudong963 · 2022-02-08T15:00:01Z

datafusion/src/physical_optimizer/repartition.rs

 fn optimize_partitions(
    target_partitions: usize,
    plan: Arc<dyn ExecutionPlan>,
-    should_repartition: bool,
+    can_reorder: bool,
+    would_benefit: bool,


Not very understand the variable, If 'would_benefit` is false, the upstream operator doesn't benefit from additional reordering, but wouldn't produce wrong results? So it's ok to repartition to benefit from high parallelism? If so, I think the variable is needless.

I noticed the annotation of the benefits_from_input_partitioning function 👍, the variable makes sense to me.

datafusion/src/physical_optimizer/repartition.rs

xudong963 · 2022-02-08T15:24:29Z

datafusion/src/physical_plan/mod.rs

+    /// parallelism may outweigh any benefits
+    ///
+    /// The default implementation returns `true`
+    fn benefits_from_input_partitioning(&self) -> bool {


I noticed the return value of sort, limit, union is false, so I want to know how to decide the result? In other words, how to decide the overhead of extra parallelism may outweigh any benefits? Is this an empirical estimate?

I think it defaults to true, actually.

Perhaps you mean why do sort limit and union override the default maintains_input_order? If so the reason is that I know how they are implemented. The code on master is making the same assumption, FWIW, but after this PR the assumption is explicit

fn maintains_input_order(&self) -> bool { // tell optimizer this operator doesn't reorder its input true }

datafusion/src/physical_optimizer/repartition.rs

Co-authored-by: xudong.w <wxd963996380@gmail.com>

alamb · 2022-02-08T17:07:37Z

However, I am a little bit uncertain about output_ordering. My understanding is it is present to allow repartitioning of branches with order-sensitive operators, such as limit, but no explicit order.

I think that is correct. The specific case that output_order is required at the moment to get correct is distinguishing between

Limit
Filter
Scan

And

Limit
Sort
Scan

I worry that this will lead two classes of hard to track down bugs:

ExecutionPlan that incorrectly report None for output_ordering

Plans that make assumptions about ordering without encoding this into Datafusion

yes, I think these are indeed two classes of hard to track down bugs that can/will occur if DataFusion starts optimizing based on sort orders. (cc @NGA-TRAN). One might argue that we already have one example of such a bug in #423 😆 . I will add some more comments to try and make it harder to forget.

I guess I just wonder if this is really worth the potential headaches 😅

Well the real question is what is the alternative 🤔 Some thoughts are:

Be conservative for operators like Limit and simply don't repartition / do anything to their inputs
I could also special case Limit (for example look for a SortExec or SortPreservingMerge anywhere below it)

alamb · 2022-02-08T17:09:36Z

I also see output_ordering as the foundation for more sophisticated transformations such as avoiding sorts when the input data is already sorted (e.g. because the parquet file was already sorted, for example, or because IOx sorted the data to deduplicate it)

…ioing2

alamb · 2022-02-08T19:47:51Z

@tustvold and @xudong963 I think I have addressed all of your comments.

Dandandan · 2022-02-08T20:12:47Z

datafusion/src/physical_plan/hash_join.rs

@@ -278,6 +279,14 @@ impl ExecutionPlan for HashJoinExec {
        self.right.output_partitioning()
    }

+    fn output_ordering(&self) -> Option<&[PhysicalSortExpr]> {
+        None


Probably doesn't make sense to address this now, but order of the right side might be (fully or partially) maintained for hash joins.

To give a concrete example, if the right side of the join is sorted on field x and we use an inner join, the output is sorted on x too as rows are not reordered.

👍 makes sense.

I think that only applies for inner joins though (as some types of outer joins may stick nulls into inconvenient places 😆 )

Yes 😂 I think right join is the other one that maintains sortedness too.

alamb · 2022-02-08T20:39:55Z

Something else I have been musing about is how to handle knowledge that the data is sorted only after a partition is executed.

For example, let's say in some future world, that when GroupByHash spills to disk it will produce the output in sorted group key order. If this is then fed into a Sort then at runtime if the GroupByHash spills the sort could simply merge its input partitions rather than having to actually sort them.

🤔

alamb · 2022-02-09T14:36:02Z

Thanks to everyone who took a look at this 👍

alamb added 2 commits February 7, 2022 15:48

Do not repartition sorted inputs SortPreservingMerge

2858e34

Add notion of sortedness to ExecutionPlan, use to avoid repartition…

463c048

…ing when that would result in incorrect behavior

github-actions bot added the datafusion Changes in the datafusion crate label Feb 7, 2022

alamb commented Feb 7, 2022

View reviewed changes

fix: fix ballitsa

9dd4524

github-actions bot added the ballista label Feb 7, 2022

alamb changed the title ~~Add `output_~~ ExecutionPlan reports on sortedness and repartitioning optimizer pass respect the invariants Feb 7, 2022

alamb added the api change Changes the API exposed to users of the crate label Feb 7, 2022

alamb changed the title ~~ExecutionPlan reports on sortedness and repartitioning optimizer pass respect the invariants~~ Update ExecutionPlan reports on sortedness and repartitioning optimizer pass respect the invariants Feb 7, 2022

alamb changed the title ~~Update ExecutionPlan reports on sortedness and repartitioning optimizer pass respect the invariants~~ Update ExecutionPlan to know about sortedness and repartitioning optimizer pass respect the invariants Feb 7, 2022

alamb mentioned this pull request Feb 7, 2022

Design how to respect output stream ordering #424

Closed

alamb marked this pull request as ready for review February 7, 2022 22:01

alamb mentioned this pull request Feb 8, 2022

Do not repartition sorted inputs SortPreservingMerge #1748

Closed

alamb requested review from jimexist, Dandandan and houqp February 8, 2022 12:08

tustvold reviewed Feb 8, 2022

View reviewed changes

xudong963 reviewed Feb 8, 2022

View reviewed changes

alamb and others added 2 commits February 8, 2022 11:56

Update datafusion/src/physical_optimizer/repartition.rs

7c974b3

Co-authored-by: xudong.w <wxd963996380@gmail.com>

Update datafusion/src/physical_optimizer/repartition.rs

09aa5ee

Co-authored-by: xudong.w <wxd963996380@gmail.com>

alamb added 7 commits February 8, 2022 12:13

Add more comments

64ffb84

Merge remote-tracking branch 'apache/master' into alamb/less_repartit…

8d1bdb2

…ioing2

Remove special EmptyExec case

785e0e0

restore default for benefits_from_input_partitioning

6583b7e

avoid unecessary check

55aef2c

default relies_on_input_order to true

a809bbd

fix test

b6b662e

Dandandan reviewed Feb 8, 2022

View reviewed changes

alamb merged commit 071f14a into apache:master Feb 9, 2022

alamb mentioned this pull request Feb 9, 2022

Fix logical conflict #1801

Merged

tustvold mentioned this pull request Mar 9, 2022

Encoding RecordBatch Sort Order apache/arrow-rs#284

Closed

alamb mentioned this pull request Jan 12, 2023

Repartition is being added incorrectly in some cases #4883

Closed

drin mentioned this pull request Mar 6, 2023

[C++][Python] A metadata standard for sorted datasets. apache/arrow#34451

Open

alamb deleted the alamb/less_repartitioing2 branch March 7, 2023 16:36

	/// benefit from additional reordering
	/// benefit from additional partitioning

Update ExecutionPlan to know about sortedness and repartitioning optimizer pass respect the invariants #1776

Update ExecutionPlan to know about sortedness and repartitioning optimizer pass respect the invariants #1776

Conversation

alamb commented Feb 7, 2022 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tustvold left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

xudong963 left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb commented Feb 8, 2022

alamb commented Feb 8, 2022

alamb commented Feb 8, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb Feb 8, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb commented Feb 8, 2022

alamb commented Feb 9, 2022

Update `ExecutionPlan` to know about sortedness and repartitioning optimizer pass respect the invariants #1776

Update `ExecutionPlan` to know about sortedness and repartitioning optimizer pass respect the invariants #1776

alamb commented Feb 7, 2022 •

edited

Loading

xudong963 left a comment •

edited

Loading

alamb Feb 8, 2022 •

edited

Loading