[SPARK-29831][SQL] Scan Hive partitioned table should not dramatically increase data parallelism #26461

viirya · 2019-11-11T02:20:16Z

What changes were proposed in this pull request?

Hive table scan operator reads each Hive partition as a HadoopRDD and unions all RDDs. The data parallelism of the result RDD can be dramatically increased, when reading a lot of partitions with a lot of files.

This patch proposes to add a config to limit the maximum of the data parallelism for scanning Hive partitioned table, when we can not convert the Hive table to datasource table.

Why are the changes needed?

Although users can also do coalesce by themselves, this patch proposes to add a config to limit the maximum of the data parallelism. Because:

end-users might not understand details and get confused by big partition number. end-users might not know why/when/where to add coalesce.
end-users need to add coalesce to each time Hive table scan. It is annoying. From the perspective of of cluster operator, it is much easier to config instead of asking each end-users to know the details and add coalesce.

Although we convert Hive table scan to datasource table scan most of time, we still have some inconvertible tables. For datasource table scan node, the parallelism is controlled by configs spark.default.parallelism and spark.sql.files.maxPartitionBytes.

But for Hive table scan node, we have nothing to control it.

So currently, when reading datasource table, end-users do not worry parallelism. When reading Hive table, end-users need to add custom coalesce hints if big parallelism is seen.

Does this PR introduce any user-facing change?

No, if not set the config.

If set a maximum value by the config, when scanning Hive partitioned table, once the number of partitions exceeds the maximum, Spark coalesces the result RDD.

How was this patch tested?

value.

viirya · 2019-11-11T02:21:35Z

cc @cloud-fan @dongjoon-hyun @felixcheung

SparkQA · 2019-11-11T04:29:35Z

Test build #113557 has finished for PR 26461 at commit 2b7615c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2019-11-11T05:32:29Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveUtils.scala

    .booleanConf
    .createWithDefault(true)

+  val HIVE_TABLE_SCAN_MAX_PARALLELISM = buildConf("spark.sql.hive.tableScan.maxParallelism")


is this really useful? The parallelism should depend on data size, and it's a hard job to tune this config.

When reading a Hive partitioned table, users could get an unreasonable number of partitions like dozens of thousands.

Hive Scan node returns a UnionRDD of Hive table partitions. Each Hive table partition is read as a HadoopRDD. For each Hive table partition, the parallelism depends on data size. But final UnionRDD sums up all number of parallelism of all Hive table partitions.

is it possible to get the size of each hadoop RDD and do coalesce automatically?

I think we can get split size and total number of splits of a Hadoop RDD.

dongjoon-hyun · 2019-11-11T19:15:59Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveUtils.scala

+      "RDD of reading such table is larger than this value, Spark will reduce the partition " +
+      "number by doing a coalesce on the RDD.")
+    .intConf
+    .createOptional


Do we need a default value?

The reason I leave it optional is to allow us keep current behavior.

dongjoon-hyun

@viirya . This is another magic number depending on the data and cluster. Why not recommending our general Hint? Is it not enough?

SELECT /*+ COALESCE(numPartitions) */
SELECT /*+ REPARTITION(numPartitions) */

dongjoon-hyun · 2019-11-11T19:18:44Z

cc @dbtsai , too.

dongjoon-hyun · 2019-11-11T19:20:31Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveTableScanSuite.scala

    }
  }

+  test("HiveTableScanExec should not increase data parallelism") {


should not increase data parallelism -> should respect HIVE_TABLE_SCAN_MAX_PARALLELISM?

viirya · 2019-11-11T19:30:10Z

@dongjoon-hyun Thanks for review.

As I mentioned in the description, although end-users can add coalesce by Dataset or SQL APIs, the reason to add this config is based on:

end-users might not understand details and get confused by big partition number. end-users might not know why/when/where to add coalesce.
end-users need to add coalesce to each time Hive table scan. It is annoying. From the perspective of of cluster operator, it is much easier to config instead of asking each end-users to know the details and add coalesce.

The current behavior of Hive scan is not friendly for end-users and cluster operator. We have better behavior on datasource table scan node, but in some user cases we still need read Hive table.

dongjoon-hyun · 2019-11-11T20:34:14Z

The optimal value for each table is unknown, isn't it? This PR doesn't give any clue for the default value for this conf because of that.

From the perspective of of cluster operator, it is much easier to config instead of asking each end-users

First, each end-users know their data and their query, but the cluster operator doesn't. IMO, this is a configuration for the end-users, not the cluster operator.
Second, this will enforce for all Hive tables without allowing exceptions. That's not good. With Hint, we can do fine-grained tuning per tables and per queries.

viirya · 2019-11-11T21:38:14Z

The optimal value for each table is unknown, isn't it? This PR doesn't give any clue for the default value for this conf because of that.

Like spark.default.parallelism, we don't have an optimal value for each job too. I was making spark.default.parallelism as default for this conf, but in the end I leave it optional to keep current behavior possibly.

I am also considering @cloud-fan's suggestion, to use data size to determine if adding coalesce or not.

First, each end-users know their data and their query, but the cluster operator doesn't. IMO, this is a configuration for the end-users, not the cluster operator.

Well, I think end-users usually do not know why there is a union and the job has big parallelism. This is implementation details under Hive Scan node. Users need dig into source code, or ask cluster operators, in order to know that.

End-users know their data and queries, it does not mean they also know where the big parallelism comes from. Because they know data and queries, they are more confused because there is no point to have the big parallelism based on their data and queries.

This is a config can be used by both end-users and cluster operators. Before this, cluster operators can not do anything. It is easier to set a config value, but it is hard to insert a hint into end-users queries.

Second, this will enforce for all Hive tables without allowing exceptions. That's not good. With Hint, we can do fine-grained tuning per tables and per queries.

This sounds good point. However, for tables and queries needed for tuning, you still can change config value or disable it and turn to hints.

This config is a guardian for preventing unreasonable number of partitions seen when reading Hive partitioned table.

viirya · 2019-11-11T21:47:15Z

Another point is, for datasource table scan node, the parallelism is controlled by configs spark.default.parallelism and spark.sql.files.maxPartitionBytes.

But for Hive table scan node, we have nothing to control it.

So currently, when reading datasource table, end-users do not worry parallelism. When reading Hive table, end-users need to add custom coalesce hints if big parallelism is seen.

dongjoon-hyun · 2019-11-11T22:16:49Z

For those performance reason, Apache Spark already converts Hive table to data source tables, doesn't we? Do you need this for your non-convertible Hive tables? If then, can we have more specifically focused PR description?

Another point is, for datasource table scan node, the parallelism is controlled by configs spark.default.parallelism and spark.sql.files.maxPartitionBytes.

But for Hive table scan node, we have nothing to control it.

So currently, when reading datasource table, end-users do not worry parallelism. When reading Hive table, end-users need to add custom coalesce hints if big parallelism is seen.

viirya · 2019-11-11T22:21:03Z

For those performance reason, Apache Spark already converts Hive table to data source tables, doesn't we? Do you need this for your non-convertible Hive tables? If then, can we have more specifically focused PR description?

Thanks. I will update the description.

cloud-fan · 2019-11-12T07:25:36Z

I agree with the problem mentioned by @viirya , but I'm not sure this config is the right cure. Users still need to know the big parallelism problem and set the config carefully.

The file source config spark.sql.files.maxPartitionBytes is much simpler to use. It defines how much data you want each task to process, and mostly you don't need to change it for different queries.

spark.default.parallelism doesn't really affect data source scan AFAIK. We do have a similar problem to set the number of reducers and we solved in with the recent adaptive execution work.

I'm OK to have a config for hive table scan, but we should make it simple set.

viirya · 2019-11-12T07:45:29Z

spark.default.parallelism doesn't really affect data source scan AFAIK. We do have a similar problem to set the number of reducers and we solved in with the recent adaptive execution work.

Because the default parallelism affects maxSplitBytes:

spark/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FilePartition.scala

Lines 86 to 95 in 053dd85

    
           def maxSplitBytes( 
        
               sparkSession: SparkSession, 
        
               selectedPartitions: Seq[PartitionDirectory]): Long = { 
        
             val defaultMaxSplitBytes = sparkSession.sessionState.conf.filesMaxPartitionBytes 
        
             val openCostInBytes = sparkSession.sessionState.conf.filesOpenCostInBytes 
        
             val defaultParallelism = sparkSession.sparkContext.defaultParallelism 
        
             val totalBytes = selectedPartitions.flatMap(_.files.map(_.getLen + openCostInBytes)).sum 
        
             val bytesPerCore = totalBytes / defaultParallelism 
        
             Math.min(defaultMaxSplitBytes, Math.max(openCostInBytes, bytesPerCore))

IIUC, spark.sql.files.maxPartitionByte and spark.default.parallelism both affect the used split byte and so final parallelism in scan.

Maybe we can also rely on maxSplitBytes in Hive Scan and decide parallelism?

cloud-fan · 2019-11-12T07:52:16Z

Maybe we can also rely on maxSplitBytes in Hive Scan and decide parallelism?

SGTM

HyukjinKwon

Well, to me I actually agree with Dongjoon's point. Why don't we just explicitly coalesce or hints for that? There are some alternatives like converting Hive table scan to Spark's scan as well.

coalesce does not necessarily make it faster. On the flip side, users might get surprised by coalesce popping up suddenly.

HyukjinKwon · 2019-11-12T09:55:33Z

Maybe we can also rely on maxSplitBytes in Hive Scan and decide parallelism?

This sounds fine in general but IIRC there have been several tries to merge big Hive partitions if I am not wrong; however, it needed a pretty big change which I don't think is worthy. e.g. #10572

viirya · 2019-11-12T15:35:04Z

@HyukjinKwon Thanks for comment!

Well, to me I actually agree with Dongjoon's point. Why don't we just explicitly coalesce or hints for that? There are some alternatives like converting Hive table scan to Spark's scan as well.
coalesce does not necessarily make it faster. On the flip side, users might get surprised by coalesce popping up suddenly.

We encourage users to convert to datasource table, but there are inconvertible cases.

We have configs for datasource table scan. But not for Hive table. It means we expect datasource scan has reasonable partition number, but not for Hive scan. For Hive table users, things gets troublesome as you need to add coalesce/hints for every query.

I think that big parallelism gets more attentions from end-users, and causes more confused. Big number of partitions wastes time on task scheduling too.

This sounds fine in general but IIRC there have been several tries to merge big Hive partitions if I am not wrong; however, it needed a pretty big change which I don't think is worthy. e.g. #10572

I think this should not be a change as big as that one.

viirya · 2019-11-22T07:51:57Z

For now we will take another approach to this issue. Hive scan has a few other questions like predicate pushdown, schema pruning.

Although I think this is a real problem, I may not have time to follow up on this. Thus close it first.

Coalesce Union RDD if its partition number exceeds maximum configed

2b7615c

value.

cloud-fan reviewed Nov 11, 2019

View reviewed changes

dongjoon-hyun added the SQL label Nov 11, 2019

dongjoon-hyun reviewed Nov 11, 2019

View reviewed changes

HyukjinKwon reviewed Nov 12, 2019

View reviewed changes

viirya closed this Nov 22, 2019

viirya deleted the hive-scan-max-parallelism branch December 27, 2023 18:23

[SPARK-29831][SQL] Scan Hive partitioned table should not dramatically increase data parallelism #26461

[SPARK-29831][SQL] Scan Hive partitioned table should not dramatically increase data parallelism #26461

Uh oh!

Conversation

viirya commented Nov 11, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

viirya commented Nov 11, 2019

Uh oh!

SparkQA commented Nov 11, 2019

Uh oh!

cloud-fan Nov 11, 2019

Choose a reason for hiding this comment

Uh oh!

viirya Nov 11, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan Nov 11, 2019

Choose a reason for hiding this comment

Uh oh!

viirya Nov 11, 2019

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun Nov 11, 2019

Choose a reason for hiding this comment

Uh oh!

viirya Nov 11, 2019

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun commented Nov 11, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dongjoon-hyun Nov 11, 2019

Choose a reason for hiding this comment

Uh oh!

viirya commented Nov 11, 2019

Uh oh!

dongjoon-hyun commented Nov 11, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

viirya commented Nov 11, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

viirya commented Nov 11, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dongjoon-hyun commented Nov 11, 2019

Uh oh!

viirya commented Nov 11, 2019

Uh oh!

cloud-fan commented Nov 12, 2019

Uh oh!

viirya commented Nov 12, 2019

Uh oh!

cloud-fan commented Nov 12, 2019

Uh oh!

HyukjinKwon left a comment

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented Nov 12, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

viirya commented Nov 12, 2019

Uh oh!

viirya commented Nov 22, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

viirya commented Nov 11, 2019 •

edited

Loading

viirya Nov 11, 2019 •

edited

Loading

dongjoon-hyun commented Nov 11, 2019 •

edited

Loading

dongjoon-hyun commented Nov 11, 2019 •

edited

Loading

viirya commented Nov 11, 2019 •

edited

Loading

viirya commented Nov 11, 2019 •

edited

Loading

HyukjinKwon commented Nov 12, 2019 •

edited

Loading