Refine the statistics estimation for the limit and aggregate operator #4716

yahoNanJing · 2022-12-23T06:29:30Z

Which issue does this PR close?

Closes #4715.

Rationale for this change

With these introduced row count info, for a SQL similar to the following pattern, the JoinSelection optimizer rule will successfully be able to choose the CollectLeft partition mode rather than the Partitioned, which reduces the query duration running on Ballista from 7.5s to 4.5s for a data set of 1.3 billion rows.

select column_of_high_cardinality,
       sum(measure_1)
from table_0
where column_of_high_cardinality in
    (select column_of_high_cardinality
     from table_0
     group by column_of_high_cardinality
     order by sum(measure_0) desc
     limit 1000)
group by column_of_high_cardinality

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

yahoNanJing · 2022-12-23T07:15:38Z

Hi @alamb, @Dandandan, @mingmwang, could you help review this PR?

By the way, this PR should be merged after #4714, since it depends on the global sort algorithm selection.

alamb · 2022-12-26T13:14:47Z

I believe this PR actually builds on #4714

(the idea of improving the statistics for limits and aggregates is a good one, 👍 )

yahoNanJing · 2022-12-28T02:56:14Z

Hi @alamb, since #4714 has been merged, now this PR is ready for review.

alamb

Code looks reasonable to me 👍

I think this PR needs some tests to verify the behavior (and make sure we don't break it by accident in a follow on PR)

alamb · 2023-01-05T19:47:15Z

Marking as draft so it is more clear the PR is awaiting some tests prior to merge

yahoNanJing · 2023-01-16T10:47:26Z

Hi @alamb, is this PR ready for review and merge now?

alamb

LGTM -- thank you @yahoNanJing

ursabot · 2023-01-17T01:33:09Z

Benchmark runs are scheduled for baseline = 03ef500 and contender = 8ab3a91. 8ab3a91 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

github-actions bot added core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) labels Dec 23, 2022

yahoNanJing force-pushed the issue-4715 branch from 714c5df to d229c08 Compare December 23, 2022 07:14

yahoNanJing requested review from alamb and Dandandan December 23, 2022 07:15

yahoNanJing marked this pull request as draft December 23, 2022 09:05

yahoNanJing force-pushed the issue-4715 branch from d229c08 to ecdc0e2 Compare December 28, 2022 02:09

github-actions bot removed the sqllogictest SQL Logic Tests (.slt) label Dec 28, 2022

yahoNanJing marked this pull request as ready for review December 28, 2022 02:10

alamb reviewed Dec 28, 2022

View reviewed changes

alamb marked this pull request as draft January 5, 2023 19:46

kyotoYaho added 2 commits January 16, 2023 18:40

Refine the statistics estimation for the limit and aggregate operator

510960b

Fix cargo clippy

e9b819f

yahoNanJing force-pushed the issue-4715 branch from ecdc0e2 to e9b819f Compare January 16, 2023 10:46

github-actions bot added the optimizer Optimizer rules label Jan 16, 2023

alamb approved these changes Jan 17, 2023

View reviewed changes

alamb marked this pull request as ready for review January 17, 2023 01:23

alamb merged commit 8ab3a91 into apache:master Jan 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refine the statistics estimation for the limit and aggregate operator #4716

Refine the statistics estimation for the limit and aggregate operator #4716

yahoNanJing commented Dec 23, 2022 •

edited

Loading

yahoNanJing commented Dec 23, 2022 •

edited

Loading

alamb commented Dec 26, 2022

yahoNanJing commented Dec 28, 2022

alamb left a comment

alamb commented Jan 5, 2023

yahoNanJing commented Jan 16, 2023

alamb left a comment

ursabot commented Jan 17, 2023

Refine the statistics estimation for the limit and aggregate operator #4716

Refine the statistics estimation for the limit and aggregate operator #4716

Conversation

yahoNanJing commented Dec 23, 2022 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

yahoNanJing commented Dec 23, 2022 • edited Loading

alamb commented Dec 26, 2022

yahoNanJing commented Dec 28, 2022

alamb left a comment

Choose a reason for hiding this comment

alamb commented Jan 5, 2023

yahoNanJing commented Jan 16, 2023

alamb left a comment

Choose a reason for hiding this comment

ursabot commented Jan 17, 2023

yahoNanJing commented Dec 23, 2022 •

edited

Loading

yahoNanJing commented Dec 23, 2022 •

edited

Loading