Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: eliminate the duplicated sort keys in Order By clause #5462

Merged
merged 4 commits into from
Mar 6, 2023

Conversation

jackwener
Copy link
Member

Which issue does this PR close?

Closes #5296.

Rationale for this change

What changes are included in this PR?

a new rule to eliminate the duplicated sort keys in Order By clause

Are these changes tested?

add unit test

Are there any user-facing changes?

@github-actions github-actions bot added the optimizer Optimizer rules label Mar 3, 2023
@jackwener
Copy link
Member Author

cc @mingmwang

@mingmwang
Copy link
Contributor

@jackwener
We should apply the same dedup logic to the Sort expr in Window.

@mingmwang
Copy link
Contributor

And I think we can not use the HashSet to do the dedup.

  1. HashSet might change the ordering of the sort exprs.
  2. We should ignore the Sort options(Asc/Desc).
    For example select * from t1 order by id desc, id, name, id asc;
    The real sort key should be: Sort Key: id DESC, name

@jackwener
Copy link
Member Author

@mingmwang thank you!
I ignore that we should keep the order.

@jackwener
Copy link
Member Author

We should apply the same dedup logic to the Sort expr in Window.

Yes. And we also can apply in Join condition / Filter expr .....

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense to me -- thank you @jackwener

I didn't see this feature hooked up into the standard list of optimizer passes (so it is not clear to me if this new rule is ever executed)

Would it be possible to write a sql level test too (maybe sqllogictest) with an explain showing that a query that had ORDER BY a, b, a indeed only sorted on ORDER BY a, b?

@github-actions github-actions bot added core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) labels Mar 6, 2023
@jackwener
Copy link
Member Author

jackwener commented Mar 6, 2023

I didn't see this feature hooked up into the standard list of optimizer passes (so it is not clear to me if this new rule is ever executed)

I forgot it 😂.

added it into optimizer and added test in sqllogicaltest.

@@ -155,6 +154,54 @@ SELECT c1, c2 FROM test ORDER BY c1 DESC, c2 ASC
0 9
0 10

# eliminate duplicated sorted expr
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 it would also be awesome to add an EXPLAIN test to show that the sort was only on c1, c2

@jackwener jackwener merged commit d0bd28e into apache:main Mar 6, 2023
@ursabot
Copy link

ursabot commented Mar 6, 2023

Benchmark runs are scheduled for baseline = 21e33a3 and contender = d0bd28e. d0bd28e is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

@mingmwang
Copy link
Contributor

@jackwener

Have you considered ignoring the sort options?

For example select * from t1 order by id desc, id, name, id asc;
The real sort key should be: Sort Key: id DESC, name

You can test this in PostgreSQL.

@mingmwang
Copy link
Contributor

You need to normalize the sort expr with same sort options to do the dedup.
I remember there was similar dedup logic to the Sort exprs in Window functions.

pub fn generate_sort_key(
    partition_by: &[Expr],
    order_by: &[Expr],
) -> Result<WindowSortKey> {
    let normalized_order_by_keys = order_by
        .iter()
        .map(|e| match e {
            Expr::Sort(Sort { expr, .. }) => {
                Ok(Expr::Sort(Sort::new(expr.clone(), true, false)))
            }
            _ => Err(DataFusionError::Plan(
                "Order by only accepts sort expressions".to_string(),
            )),
        })
        .collect::<Result<Vec<_>>>()?;

    let mut final_sort_keys = vec![];
    let mut is_partition_flag = vec![];
    partition_by.iter().for_each(|e| {
        // By default, create sort key with ASC is true and NULLS LAST to be consistent with
        // PostgreSQL's rule: https://www.postgresql.org/docs/current/queries-order.html
        let e = e.clone().sort(true, false);
        if let Some(pos) = normalized_order_by_keys.iter().position(|key| key.eq(&e)) {
            let order_by_key = &order_by[pos];
            if !final_sort_keys.contains(order_by_key) {
                final_sort_keys.push(order_by_key.clone());
                is_partition_flag.push(true);
            }
        } else if !final_sort_keys.contains(&e) {
            final_sort_keys.push(e);
            is_partition_flag.push(true);
        }
    });

    order_by.iter().for_each(|e| {
        if !final_sort_keys.contains(e) {
            final_sort_keys.push(e.clone());
            is_partition_flag.push(false);
        }
    });
    let res = final_sort_keys
        .into_iter()
        .zip(is_partition_flag)
        .map(|(lhs, rhs)| (lhs, rhs))
        .collect::<Vec<_>>();
    Ok(res)
}

@jackwener jackwener deleted the eliminate_sort branch March 7, 2023 04:33
@jackwener
Copy link
Member Author

@jackwener

Have you considered ignoring the sort options?

For example select * from t1 order by id desc, id, name, id asc; The real sort key should be: Sort Key: id DESC, name

You can test this in PostgreSQL.

@mingmwang Make great sense to me, can you new a PR to enhance it?

@mingmwang
Copy link
Contributor

Sure, I will work on the following PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate enhancement New feature or request optimizer Optimizer rules sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Trim the duplicated sort keys in Order By clause
5 participants