-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: eliminate the duplicated sort keys in Order By clause #5462
Conversation
9d537a2
to
4c5b315
Compare
cc @mingmwang |
@jackwener |
And I think we can not use the HashSet to do the dedup.
|
@mingmwang thank you! |
Yes. And we also can apply in Join condition / Filter expr ..... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense to me -- thank you @jackwener
I didn't see this feature hooked up into the standard list of optimizer passes (so it is not clear to me if this new rule is ever executed)
Would it be possible to write a sql level test too (maybe sqllogictest) with an explain showing that a query that had ORDER BY a, b, a
indeed only sorted on ORDER BY a, b
?
I forgot it 😂. added it into optimizer and added test in sqllogicaltest. |
105a33e
to
4c484a5
Compare
@@ -155,6 +154,54 @@ SELECT c1, c2 FROM test ORDER BY c1 DESC, c2 ASC | |||
0 9 | |||
0 10 | |||
|
|||
# eliminate duplicated sorted expr |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 it would also be awesome to add an EXPLAIN
test to show that the sort was only on c1, c2
551f257
to
aa06288
Compare
Benchmark runs are scheduled for baseline = 21e33a3 and contender = d0bd28e. d0bd28e is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
Have you considered ignoring the sort options? For example You can test this in PostgreSQL. |
You need to normalize the sort expr with same sort options to do the dedup. pub fn generate_sort_key(
partition_by: &[Expr],
order_by: &[Expr],
) -> Result<WindowSortKey> {
let normalized_order_by_keys = order_by
.iter()
.map(|e| match e {
Expr::Sort(Sort { expr, .. }) => {
Ok(Expr::Sort(Sort::new(expr.clone(), true, false)))
}
_ => Err(DataFusionError::Plan(
"Order by only accepts sort expressions".to_string(),
)),
})
.collect::<Result<Vec<_>>>()?;
let mut final_sort_keys = vec![];
let mut is_partition_flag = vec![];
partition_by.iter().for_each(|e| {
// By default, create sort key with ASC is true and NULLS LAST to be consistent with
// PostgreSQL's rule: https://www.postgresql.org/docs/current/queries-order.html
let e = e.clone().sort(true, false);
if let Some(pos) = normalized_order_by_keys.iter().position(|key| key.eq(&e)) {
let order_by_key = &order_by[pos];
if !final_sort_keys.contains(order_by_key) {
final_sort_keys.push(order_by_key.clone());
is_partition_flag.push(true);
}
} else if !final_sort_keys.contains(&e) {
final_sort_keys.push(e);
is_partition_flag.push(true);
}
});
order_by.iter().for_each(|e| {
if !final_sort_keys.contains(e) {
final_sort_keys.push(e.clone());
is_partition_flag.push(false);
}
});
let res = final_sort_keys
.into_iter()
.zip(is_partition_flag)
.map(|(lhs, rhs)| (lhs, rhs))
.collect::<Vec<_>>();
Ok(res)
} |
@mingmwang Make great sense to me, can you new a PR to enhance it? |
Sure, I will work on the following PR. |
Which issue does this PR close?
Closes #5296.
Rationale for this change
What changes are included in this PR?
a new rule to eliminate the duplicated sort keys in Order By clause
Are these changes tested?
add unit test
Are there any user-facing changes?