Skip to content

Conversation

@fuwhu
Copy link
Contributor

@fuwhu fuwhu commented Jan 15, 2020

What changes were proposed in this pull request?

Refine FileScan.estimateStatistics to take partitionFilters into account.

Why are the changes needed?

Currently, FileScan.estimateStatistics does not take partitionFilters into account, which may lead to bigger sizeInBytes. It should be reasonable to change it to involve partitionFilters when estimating the statistics.

Does this PR introduce any user-facing change?

no

How was this patch tested?

existing unit tests.

@SparkQA
Copy link

SparkQA commented Jan 15, 2020

Test build #116754 has finished for PR 27213 at commit a5a3012.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@fuwhu
Copy link
Contributor Author

fuwhu commented Jan 16, 2020

retest this please.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This config is proposed in #27129 , will resolve the conflict after #27129 finished.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about marking this PR as WIP until #27129 is merged?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, thanks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since dataFilters is not used in all sub classes of FileIndex, so just remove the dataFilters parameter here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fuwhu I would suggest keep the current method with parameter dataFilters.
See the discussions in #27157
Also, the rename is not related to the proposal of this PR, right?

Copy link
Contributor Author

@fuwhu fuwhu Jan 30, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea, just changed it since i thought this method return sequence of PartitionDirectory objects, which is not actually to list files.
Will change it back to keep the PR proposal clear.

Copy link
Contributor Author

@fuwhu fuwhu Jan 31, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gengliangwang
I removed the conf MAX_PARTITION_NUMBER_FOR_STATS_CALCULATION_VIA_FS in this PR, since it is not needed per discussion in #27129 .

@SparkQA
Copy link

SparkQA commented Jan 16, 2020

Test build #116830 has finished for PR 27213 at commit a5a3012.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 16, 2020

Test build #116837 has finished for PR 27213 at commit d7485ce.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 16, 2020

Test build #116839 has finished for PR 27213 at commit d8949ba.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 28, 2020

Test build #117463 has finished for PR 27213 at commit d68b50f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@fuwhu fuwhu changed the title [SPARK-30516][SQL] statistic estimation of FileScan should take partitionFilters and partition number into account [SPARK-30516][SQL][WIP] statistic estimation of FileScan should take partitionFilters and partition number into account Jan 30, 2020
@SparkQA
Copy link

SparkQA commented Jan 30, 2020

Test build #117571 has finished for PR 27213 at commit 7790a88.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@fuwhu
Copy link
Contributor Author

fuwhu commented Jan 31, 2020

retest this please

@fuwhu fuwhu changed the title [SPARK-30516][SQL][WIP] statistic estimation of FileScan should take partitionFilters and partition number into account [SPARK-30516][SQL] statistic estimation of FileScan should take partitionFilters and partition number into account Jan 31, 2020
@SparkQA
Copy link

SparkQA commented Jan 31, 2020

Test build #117601 has finished for PR 27213 at commit 7790a88.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@fuwhu fuwhu changed the title [SPARK-30516][SQL] statistic estimation of FileScan should take partitionFilters and partition number into account [SPARK-30516][SQL] statistic estimation of FileScan should take partitionFilters into account Jan 31, 2020
@SparkQA
Copy link

SparkQA commented Jan 31, 2020

Test build #117621 has finished for PR 27213 at commit e948650.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 31, 2020

Test build #117618 has finished for PR 27213 at commit c29feb1.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@fuwhu
Copy link
Contributor Author

fuwhu commented Jan 31, 2020

retest this please

@SparkQA
Copy link

SparkQA commented Jan 31, 2020

Test build #117652 has finished for PR 27213 at commit e948650.

  • This patch fails from timeout after a configured wait of 400m.
  • This patch merges cleanly.
  • This patch adds no public classes.

@fuwhu
Copy link
Contributor Author

fuwhu commented Feb 1, 2020

retest this please

@SparkQA
Copy link

SparkQA commented Feb 1, 2020

Test build #117702 has finished for PR 27213 at commit e948650.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@fuwhu
Copy link
Contributor Author

fuwhu commented Feb 4, 2020

cc @cloud-fan

Copy link
Contributor Author

@fuwhu fuwhu Feb 4, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just add an assertion incidentally here, this is not necessary for this PR.
if not ok, i can remove it.

@SparkQA
Copy link

SparkQA commented Feb 4, 2020

Test build #117799 has finished for PR 27213 at commit 86b9043.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@fuwhu
Copy link
Contributor Author

fuwhu commented Feb 4, 2020

retest this please

@SparkQA
Copy link

SparkQA commented Feb 4, 2020

Test build #117803 has started for PR 27213 at commit 86b9043.

@SparkQA
Copy link

SparkQA commented Feb 4, 2020

Test build #117797 has finished for PR 27213 at commit 95d59a3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu
Copy link
Member

maropu commented Feb 10, 2020

Could you add some tests for this improvement?

@fuwhu
Copy link
Contributor Author

fuwhu commented Feb 10, 2020

Could you add some tests for this improvement?

@maropu Added one test, please help review. thanks.

@SparkQA
Copy link

SparkQA commented Feb 10, 2020

Test build #118147 has finished for PR 27213 at commit 2e210a0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@fuwhu fuwhu changed the title [SPARK-30516][SQL] statistic estimation of FileScan should take partitionFilters into account [SPARK-30516][SQL] involve partition filters in the statistic estimation of FileScan Feb 11, 2020
@fuwhu
Copy link
Contributor Author

fuwhu commented Feb 11, 2020

cc @cloud-fan

@github-actions
Copy link

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

@github-actions github-actions bot added the Stale label May 22, 2020
@github-actions github-actions bot closed this May 23, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants