Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refine the statistics estimation for the limit and aggregate operator #4716

Merged
merged 2 commits into from
Jan 17, 2023

Conversation

yahoNanJing
Copy link
Contributor

@yahoNanJing yahoNanJing commented Dec 23, 2022

Which issue does this PR close?

Closes #4715.

Rationale for this change

With these introduced row count info, for a SQL similar to the following pattern, the JoinSelection optimizer rule will successfully be able to choose the CollectLeft partition mode rather than the Partitioned, which reduces the query duration running on Ballista from 7.5s to 4.5s for a data set of 1.3 billion rows.

select column_of_high_cardinality,
       sum(measure_1)
from table_0
where column_of_high_cardinality in
    (select column_of_high_cardinality
     from table_0
     group by column_of_high_cardinality
     order by sum(measure_0) desc
     limit 1000)
group by column_of_high_cardinality

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

@github-actions github-actions bot added core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) labels Dec 23, 2022
@yahoNanJing
Copy link
Contributor Author

yahoNanJing commented Dec 23, 2022

Hi @alamb, @Dandandan, @mingmwang, could you help review this PR?

By the way, this PR should be merged after #4714, since it depends on the global sort algorithm selection.

@yahoNanJing yahoNanJing marked this pull request as draft December 23, 2022 09:05
@alamb
Copy link
Contributor

alamb commented Dec 26, 2022

I believe this PR actually builds on #4714

(the idea of improving the statistics for limits and aggregates is a good one, 👍 )

@github-actions github-actions bot removed the sqllogictest SQL Logic Tests (.slt) label Dec 28, 2022
@yahoNanJing yahoNanJing marked this pull request as ready for review December 28, 2022 02:10
@yahoNanJing
Copy link
Contributor Author

Hi @alamb, since #4714 has been merged, now this PR is ready for review.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code looks reasonable to me 👍

I think this PR needs some tests to verify the behavior (and make sure we don't break it by accident in a follow on PR)

@alamb alamb marked this pull request as draft January 5, 2023 19:46
@alamb
Copy link
Contributor

alamb commented Jan 5, 2023

Marking as draft so it is more clear the PR is awaiting some tests prior to merge

@yahoNanJing
Copy link
Contributor Author

Hi @alamb, is this PR ready for review and merge now?

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM -- thank you @yahoNanJing

@alamb alamb marked this pull request as ready for review January 17, 2023 01:23
@alamb alamb merged commit 8ab3a91 into apache:master Jan 17, 2023
@ursabot
Copy link

ursabot commented Jan 17, 2023

Benchmark runs are scheduled for baseline = 03ef500 and contender = 8ab3a91. 8ab3a91 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate optimizer Optimizer rules
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Refine the statistics estimation for the limit and aggregate operator
4 participants