feat: eliminate the duplicated sort keys in Order By clause #5462

jackwener · 2023-03-03T04:15:22Z

Which issue does this PR close?

Closes #5296.

Rationale for this change

What changes are included in this PR?

a new rule to eliminate the duplicated sort keys in Order By clause

Are these changes tested?

add unit test

Are there any user-facing changes?

jackwener · 2023-03-03T04:17:18Z

cc @mingmwang

mingmwang · 2023-03-03T05:44:03Z

@jackwener
We should apply the same dedup logic to the Sort expr in Window.

mingmwang · 2023-03-03T05:50:51Z

And I think we can not use the HashSet to do the dedup.

HashSet might change the ordering of the sort exprs.
We should ignore the Sort options(Asc/Desc).
For example select * from t1 order by id desc, id, name, id asc;
The real sort key should be: Sort Key: id DESC, name

jackwener · 2023-03-03T07:06:29Z

@mingmwang thank you!
I ignore that we should keep the order.

jackwener · 2023-03-03T07:20:44Z

We should apply the same dedup logic to the Sort expr in Window.

Yes. And we also can apply in Join condition / Filter expr .....

alamb

Makes sense to me -- thank you @jackwener

I didn't see this feature hooked up into the standard list of optimizer passes (so it is not clear to me if this new rule is ever executed)

Would it be possible to write a sql level test too (maybe sqllogictest) with an explain showing that a query that had ORDER BY a, b, a indeed only sorted on ORDER BY a, b?

jackwener · 2023-03-06T06:55:30Z

I didn't see this feature hooked up into the standard list of optimizer passes (so it is not clear to me if this new rule is ever executed)

I forgot it 😂.

added it into optimizer and added test in sqllogicaltest.

alamb · 2023-03-06T11:11:49Z

datafusion/core/tests/sqllogictests/test_files/order.slt

@@ -155,6 +154,54 @@ SELECT c1, c2 FROM test ORDER BY c1 DESC, c2 ASC
 0 9
 0 10

+# eliminate duplicated sorted expr


👍 it would also be awesome to add an EXPLAIN test to show that the sort was only on c1, c2

ursabot · 2023-03-06T15:02:16Z

Benchmark runs are scheduled for baseline = 21e33a3 and contender = d0bd28e. d0bd28e is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

mingmwang · 2023-03-07T02:15:43Z

@jackwener

Have you considered ignoring the sort options?

For example select * from t1 order by id desc, id, name, id asc;
The real sort key should be: Sort Key: id DESC, name

You can test this in PostgreSQL.

mingmwang · 2023-03-07T02:20:03Z

You need to normalize the sort expr with same sort options to do the dedup.
I remember there was similar dedup logic to the Sort exprs in Window functions.

pub fn generate_sort_key(
    partition_by: &[Expr],
    order_by: &[Expr],
) -> Result<WindowSortKey> {
    let normalized_order_by_keys = order_by
        .iter()
        .map(|e| match e {
            Expr::Sort(Sort { expr, .. }) => {
                Ok(Expr::Sort(Sort::new(expr.clone(), true, false)))
            }
            _ => Err(DataFusionError::Plan(
                "Order by only accepts sort expressions".to_string(),
            )),
        })
        .collect::<Result<Vec<_>>>()?;

    let mut final_sort_keys = vec![];
    let mut is_partition_flag = vec![];
    partition_by.iter().for_each(|e| {
        // By default, create sort key with ASC is true and NULLS LAST to be consistent with
        // PostgreSQL's rule: https://www.postgresql.org/docs/current/queries-order.html
        let e = e.clone().sort(true, false);
        if let Some(pos) = normalized_order_by_keys.iter().position(|key| key.eq(&e)) {
            let order_by_key = &order_by[pos];
            if !final_sort_keys.contains(order_by_key) {
                final_sort_keys.push(order_by_key.clone());
                is_partition_flag.push(true);
            }
        } else if !final_sort_keys.contains(&e) {
            final_sort_keys.push(e);
            is_partition_flag.push(true);
        }
    });

    order_by.iter().for_each(|e| {
        if !final_sort_keys.contains(e) {
            final_sort_keys.push(e.clone());
            is_partition_flag.push(false);
        }
    });
    let res = final_sort_keys
        .into_iter()
        .zip(is_partition_flag)
        .map(|(lhs, rhs)| (lhs, rhs))
        .collect::<Vec<_>>();
    Ok(res)
}

jackwener · 2023-03-07T04:35:04Z

@jackwener

Have you considered ignoring the sort options?

For example select * from t1 order by id desc, id, name, id asc; The real sort key should be: Sort Key: id DESC, name

You can test this in PostgreSQL.

@mingmwang Make great sense to me, can you new a PR to enhance it?

mingmwang · 2023-03-07T06:31:46Z

Sure, I will work on the following PR.

github-actions bot added the optimizer Optimizer rules label Mar 3, 2023

feat: eliminate the duplicated sort keys in Order By clause

4c5b315

jackwener force-pushed the eliminate_sort branch from 9d537a2 to 4c5b315 Compare March 3, 2023 04:15

must keep sort order.

c1d38b5

alamb reviewed Mar 4, 2023

View reviewed changes

github-actions bot added core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) labels Mar 6, 2023

add it into optimizer

4c484a5

jackwener force-pushed the eliminate_sort branch from 105a33e to 4c484a5 Compare March 6, 2023 07:14

alamb approved these changes Mar 6, 2023

View reviewed changes

add plantree test

aa06288

jackwener force-pushed the eliminate_sort branch from 551f257 to aa06288 Compare March 6, 2023 14:07

jackwener merged commit d0bd28e into apache:main Mar 6, 2023

jackwener deleted the eliminate_sort branch March 7, 2023 04:33

andygrove added the enhancement New feature or request label Mar 12, 2023

mingmwang mentioned this pull request Mar 15, 2023

[FOLLOWUP] eliminate the duplicated sort keys in Order By clause #5607

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: eliminate the duplicated sort keys in Order By clause #5462

feat: eliminate the duplicated sort keys in Order By clause #5462

jackwener commented Mar 3, 2023

jackwener commented Mar 3, 2023

mingmwang commented Mar 3, 2023

mingmwang commented Mar 3, 2023

jackwener commented Mar 3, 2023

jackwener commented Mar 3, 2023

alamb left a comment

jackwener commented Mar 6, 2023 •

edited

Loading

alamb Mar 6, 2023

ursabot commented Mar 6, 2023

mingmwang commented Mar 7, 2023

mingmwang commented Mar 7, 2023

jackwener commented Mar 7, 2023

mingmwang commented Mar 7, 2023

feat: eliminate the duplicated sort keys in Order By clause #5462

feat: eliminate the duplicated sort keys in Order By clause #5462

Conversation

jackwener commented Mar 3, 2023

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

jackwener commented Mar 3, 2023

mingmwang commented Mar 3, 2023

mingmwang commented Mar 3, 2023

jackwener commented Mar 3, 2023

jackwener commented Mar 3, 2023

alamb left a comment

Choose a reason for hiding this comment

jackwener commented Mar 6, 2023 • edited Loading

alamb Mar 6, 2023

Choose a reason for hiding this comment

ursabot commented Mar 6, 2023

mingmwang commented Mar 7, 2023

mingmwang commented Mar 7, 2023

jackwener commented Mar 7, 2023

mingmwang commented Mar 7, 2023

jackwener commented Mar 6, 2023 •

edited

Loading