-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refine the statistics estimation for the limit and aggregate operator #4716
Conversation
714c5df
to
d229c08
Compare
Hi @alamb, @Dandandan, @mingmwang, could you help review this PR? By the way, this PR should be merged after #4714, since it depends on the global sort algorithm selection. |
I believe this PR actually builds on #4714 (the idea of improving the statistics for limits and aggregates is a good one, 👍 ) |
d229c08
to
ecdc0e2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code looks reasonable to me 👍
I think this PR needs some tests to verify the behavior (and make sure we don't break it by accident in a follow on PR)
Marking as draft so it is more clear the PR is awaiting some tests prior to merge |
ecdc0e2
to
e9b819f
Compare
Hi @alamb, is this PR ready for review and merge now? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM -- thank you @yahoNanJing
Benchmark runs are scheduled for baseline = 03ef500 and contender = 8ab3a91. 8ab3a91 is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
Which issue does this PR close?
Closes #4715.
Rationale for this change
With these introduced row count info, for a SQL similar to the following pattern, the
JoinSelection
optimizer rule will successfully be able to choose theCollectLeft
partition mode rather than thePartitioned
, which reduces the query duration running on Ballista from 7.5s to 4.5s for a data set of 1.3 billion rows.What changes are included in this PR?
Are these changes tested?
Are there any user-facing changes?