Skip to content

Conversation

@fuwhu
Copy link
Contributor

@fuwhu fuwhu commented Jan 8, 2020

What changes were proposed in this pull request?

Add config "spark.sql.statistics.fallBackToFs.maxPartitionNumber" and use it to control whether calculate statistics through file system.

Why are the changes needed?

Currently, when spark need to calculate the statistics (eg. sizeInBytes) of table partition through file system (eg. HDFS), it does not consider the number of partitions. Then if the the number of partitions is huge, it will cost much time to calculate the statistics which may be not be that useful.

It should be reasonable to add a config item to control the limit of partition number allowable to calculate statistics through file system.

Does this PR introduce any user-facing change?

Yes, statistics of logical plan may be changed which may impact some spark strategies part, eg. JoinSelection.

How was this patch tested?

Added new unit test.

@fuwhu
Copy link
Contributor Author

fuwhu commented Jan 8, 2020

This config can also be used in PruneHiveTablePartitions which is proposed in #26805 .
Will update after #26805 finished.

@fuwhu
Copy link
Contributor Author

fuwhu commented Jan 8, 2020

cc: @cloud-fan @wangyum

@SparkQA
Copy link

SparkQA commented Jan 8, 2020

Test build #116283 has finished for PR 27129 at commit dc0a6d1.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@fuwhu
Copy link
Contributor Author

fuwhu commented Jan 8, 2020

retest this please

@SparkQA
Copy link

SparkQA commented Jan 8, 2020

Test build #116294 has finished for PR 27129 at commit dc0a6d1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 9, 2020

Test build #116362 has finished for PR 27129 at commit abeb326.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@fuwhu
Copy link
Contributor Author

fuwhu commented Jan 9, 2020

retest this please

@SparkQA
Copy link

SparkQA commented Jan 9, 2020

Test build #116375 has finished for PR 27129 at commit abeb326.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@fuwhu
Copy link
Contributor Author

fuwhu commented Jan 9, 2020

retest this please

@SparkQA
Copy link

SparkQA commented Jan 9, 2020

Test build #116394 has finished for PR 27129 at commit abeb326.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 10, 2020

Test build #116459 has finished for PR 27129 at commit 2ff9960.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 10, 2020

Test build #116471 has finished for PR 27129 at commit 7e19d7f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 12, 2020

Test build #116534 has finished for PR 27129 at commit 0a5ae3f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@fuwhu fuwhu changed the title [SPARK-30427] Add config item for limiting partition number when calculating statistics through File System [SPARK-30427][SQL] Add config item for limiting partition number when calculating statistics through File System Jan 14, 2020
@SparkQA
Copy link

SparkQA commented Jan 14, 2020

Test build #116712 has finished for PR 27129 at commit aac2bbe.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 16, 2020

Test build #116842 has finished for PR 27129 at commit b29363b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 22, 2020

Test build #117242 has finished for PR 27129 at commit 5c432f0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@fuwhu
Copy link
Contributor Author

fuwhu commented Jan 28, 2020

cc @cloud-fan

@fuwhu fuwhu requested review from cloud-fan and wangyum and removed request for wangyum January 28, 2020 02:27
@SparkQA
Copy link

SparkQA commented Jan 28, 2020

Test build #117458 has finished for PR 27129 at commit a661994.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@fuwhu fuwhu requested a review from gengliangwang January 30, 2020 15:36
Copy link
Member

@gengliangwang gengliangwang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fuwhu InMemoryFileIndex caches all the file status on construction. Is it true that statistic calculation is very expensive?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

4 spaces indent

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fuwhu InMemoryFileIndex caches all the file status on construction. Is it true that statistic calculation is very expensive?

yea, you are right. The refresh0 method will call listLeafFiles to get all file status, which already implement listing in parallel when path number exceed threshold.
So the statistic calculation here is actually just a sum of the length of all leaf files. Will remove the conf check here.

But for hive table, currently, there is no some place to get the file/dir status directly without accessing the file system. So i think it is still necessary to limit the partition number of statistic calculation. WDYT ?

For parallel statistic calculation, i am not sure whether it is worthwhile to start a distributed job to do the statistic calculation or start multiple threads to do it ? The benefit of statistic computation may not cover the cost of the statistic calculation. WDYT ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it is a good idea to limit the number of partitions.

Copy link
Contributor Author

@fuwhu fuwhu Jan 31, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gengliangwang So you prefer to do the statistic calculation in parallel in case the partition number exceed the threshold?
@cloud-fan WDYT?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The file listing is parallel, and the statistic calculation is just getting the sum of the files. So no need to make the statistic calculation parallel.
I updated my comments minutes after I left them.

Copy link
Contributor Author

@fuwhu fuwhu Jan 31, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the statistic calculation is just getting the sum of the files

this is true in PruneFileSourcePartitions, because InMemoryFileIndex already did file listing when InMemoryFileIndex object is constructed in CatalogFileIndex.filterPartitions method.
so I already removed the partition number check in PruneFileSourcePartitions.

but this is not true in PruneHiveTablePartitions, the size of each hive table partition need to be calculated via HDFS if it's not available in meta data.

Please correct me if i am wrong, thanks.

@SparkQA
Copy link

SparkQA commented Jan 31, 2020

Test build #117616 has started for PR 27129 at commit 3c27224.

if (partitionsWithSize.forall(_._2 > 0)) {
val sizeInBytes = partitionsWithSize.map(_._2).sum
tableMeta.copy(stats = Some(CatalogStatistics(sizeInBytes = BigInt(sizeInBytes))))
} else if (partitionsWithSize.count(_._2 == 0) <= conf.maxPartNumForStatsCalculateViaFS) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need to do calculation here? I think it introduces extra cost.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you mean just leave the tableMeta unchanged (which is table level meta without partition pruning) if there is at least one partition whose size is not available in meta store ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

Copy link
Contributor Author

@fuwhu fuwhu Jan 31, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some discussion about this was in #26805 (comment)

@fuwhu
Copy link
Contributor Author

fuwhu commented Jan 31, 2020

retest this please

@SparkQA
Copy link

SparkQA commented Jan 31, 2020

Test build #117626 has finished for PR 27129 at commit 3c27224.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@fuwhu
Copy link
Contributor Author

fuwhu commented Jan 31, 2020

retest this please

@SparkQA
Copy link

SparkQA commented Jan 31, 2020

Test build #117648 has finished for PR 27129 at commit 3c27224.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 31, 2020

Test build #117665 has finished for PR 27129 at commit 8b14ce4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

if (partitionsWithSize.forall(_._2 > 0)) {
val sizeInBytes = partitionsWithSize.map(_._2).sum
tableMeta.copy(stats = Some(CatalogStatistics(sizeInBytes = BigInt(sizeInBytes))))
} else if (partitionsWithSize.count(_._2 == 0) <= conf.maxPartNumForStatsCalculateViaFS) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fuwhu, are you're proposing a configuration to automatically calculate the size? why don't you just manually run analyze comment to calculate the stats? It's weird to do this based on the number of partitions.

@github-actions
Copy link

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

@github-actions github-actions bot added the Stale label May 29, 2020
@github-actions github-actions bot closed this May 30, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants