-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-30516][SQL] involve partition filters in the statistic estimation of FileScan #27213
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #116754 has finished for PR 27213 at commit
|
|
retest this please. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about marking this PR as WIP until #27129 is merged?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure, thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since dataFilters is not used in all sub classes of FileIndex, so just remove the dataFilters parameter here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea, just changed it since i thought this method return sequence of PartitionDirectory objects, which is not actually to list files.
Will change it back to keep the PR proposal clear.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gengliangwang
I removed the conf MAX_PARTITION_NUMBER_FOR_STATS_CALCULATION_VIA_FS in this PR, since it is not needed per discussion in #27129 .
|
Test build #116830 has finished for PR 27213 at commit
|
|
Test build #116837 has finished for PR 27213 at commit
|
|
Test build #116839 has finished for PR 27213 at commit
|
|
Test build #117463 has finished for PR 27213 at commit
|
|
Test build #117571 has finished for PR 27213 at commit
|
|
retest this please |
|
Test build #117601 has finished for PR 27213 at commit
|
|
Test build #117621 has finished for PR 27213 at commit
|
|
Test build #117618 has finished for PR 27213 at commit
|
|
retest this please |
|
Test build #117652 has finished for PR 27213 at commit
|
|
retest this please |
|
Test build #117702 has finished for PR 27213 at commit
|
|
cc @cloud-fan |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just add an assertion incidentally here, this is not necessary for this PR.
if not ok, i can remove it.
|
Test build #117799 has finished for PR 27213 at commit
|
|
retest this please |
|
Test build #117803 has started for PR 27213 at commit |
|
Test build #117797 has finished for PR 27213 at commit
|
|
Could you add some tests for this improvement? |
… statistic sizeInBytes in FileScan.estimateStatistics
…s equal to the root paths of patitions specified by userSpecifiedPartitionSpec.
and MAX_PARTITION_NUMBER_FOR_STATS_CALCULATION_VIA_FS into account.
@maropu Added one test, please help review. thanks. |
|
Test build #118147 has finished for PR 27213 at commit
|
|
cc @cloud-fan |
|
We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. |
What changes were proposed in this pull request?
Refine FileScan.estimateStatistics to take partitionFilters into account.
Why are the changes needed?
Currently, FileScan.estimateStatistics does not take partitionFilters into account, which may lead to bigger sizeInBytes. It should be reasonable to change it to involve partitionFilters when estimating the statistics.
Does this PR introduce any user-facing change?
no
How was this patch tested?
existing unit tests.