[SPARK-28098][SQL]Supporting non-partitioned Hive tables with subdirectories #32202

FatalLin · 2021-04-16T05:54:39Z

What changes were proposed in this pull request?

In this PR, I proposed a function to allow HiveMetastoreCatalog could read source files of non-partitioned hive table with subdirectories when new configuration "spark.sql.nonPartitionedTable.subdirectory.read.enabled" is set.

Why are the changes needed?

Hive already have configurations to handle similar issues, but the built-in reader couldn't.

Does this PR introduce any user-facing change?

Yes, the new configuration "spark.sql.nonPartitionedTable.subdirectory.read.enabled" has been added and the default value is "false."

How was this patch tested?

New tests.

attilapiros · 2021-04-17T10:53:58Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala

Why not using the existing SparkHadoopUtil.listLeafDirStatuses(fs, rootPath)?

Didn't notice the function before, will change it. Thanks.

sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala

attilapiros · 2021-04-17T11:11:45Z

@FatalLin Thanks for your contribution! Welcome here!

This is not my focus area but I have added some comments. So let's cc. some more competent developers:
@dongjoon-hyun @viirya

But of course I can enable the testing for you.

ok to test

SparkQA · 2021-04-17T12:32:16Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42079/

SparkQA · 2021-04-17T12:32:17Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42079/

SparkQA · 2021-04-17T13:02:30Z

Test build #137505 has finished for PR 32202 at commit eb56adb.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

FatalLin · 2021-04-17T16:45:30Z

I didn't get the reason why the Notify test workflow always failed due to some 404 Not Found Exception which I think I didn't change anything will cause it, does anyone have idea on it? Thanks.

SparkQA · 2021-04-17T17:32:20Z

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42081/

SparkQA · 2021-04-17T18:48:25Z

Test build #137507 has finished for PR 32202 at commit 1818fc5.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

c21

I didn't get the reason why the Notify test workflow always failed due to some 404 Not Found Exception which I think I didn't change anything will cause it, does anyone have idea on it? Thanks.

@FatalLin - you can rebase to latest master branch, and this error should go away.

sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala

attilapiros · 2021-04-18T16:20:48Z

@FatalLin
Some more thoughts/question:

Why are two configs in Hive for this?

mapred.input.dir.recursive
hive.mapred.supports.subdirectories

How does Hive do when only one is true? If there both needed we need to check both too!
Please update the title: skip the part "when configuration is enable" and reword the rest.
What about "Supporting non-partitioned Hive tables with subdirectories".
Please update the description, too. In "What changes were proposed in this pull request?" its enough if you explain the the title a bit more. I suggest to use a spell checker to avoid errors like: setted => set, configurtions => configuration.
Please note the PR description is extremely important as after the PR is merged it will be the commit message.
At "Does this PR introduce any user-facing change?" elaborate on the impact of this change. Remove the "maybe we could add this option in documents to notice users for the enhancement." which I think is a good idea and should be part of this PR.

FatalLin · 2021-04-18T16:28:15Z

@FatalLin
Some more thoughts/question:

Why are two configs in Hive for this?

mapred.input.dir.recursive

hive.mapred.supports.subdirectories

How does Hive do when only one is true? If there both needed we need to check both too!

Please update the title: skip the part "when configuration is enable" and reword the rest.
What about "Supporting non-partitioned Hive tables with subdirectories".

Please update the description, too. In "What changes were proposed in this pull request?" its enough if you explain the the title a bit more. I suggest to use a spell checker to avoid errors like: setted => set, configurtions => configuration.
Please note the PR description is extremely important as after the PR is merged it will be the commit message.

At "Does this PR introduce any user-facing change?" elaborate on the impact of this change. Remove the "maybe we could add this option in documents to notice users for the enhancement." which I think is a good idea and should be part of this PR.

got it, it's a great help for me, really appreciated!
I'll address all the questions you mentioned.

FatalLin · 2021-04-19T06:36:02Z

about the configuration "mapred.input.dir.recursive" and "hive.mapred.supports.subdirectories", I found a brief introduction in hive document:
hive.mapred.supports.subdirectories Default Value: false Added In: Hive 0.10.0 with HIVE-3276 Whether the version of Hadoop which is running supports sub-directories for tables/partitions. Many Hive optimizations can be applied if the Hadoop version supports sub-directories for tables/partitions. This support was added by MAPREDUCE-1501. (https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties)
looks like "mapred.input.dir.recursive" allow map-reduce could read files from sub-directories, and "hive.mapred.supports.subdirectories" allow hive could do some sub-directories related optimization. In my first thought that due to hive and map-reduce is separate project so that's make sense that they have each own configuration about it. But in spark, the operation is only happened in spark-sql, so I only check hive-side configuration "hive.mapred.supports.subdirectories" earlier. How do you think? @attilapiros

attilapiros · 2021-04-19T07:36:24Z

But in spark, the operation is only happened in spark-sql, so I only check hive-side configuration "hive.mapred.supports.subdirectories" earlier. How do you think? @attilapiros

The original intention of this PR is to be compatible with Hive so I would check both configs as on the same machine I would expect to get the same answers when querying the non partitioned table with subdirectories.

FatalLin · 2021-04-19T07:38:04Z

But in spark, the operation is only happened in spark-sql, so I only check hive-side configuration "hive.mapred.supports.subdirectories" earlier. How do you think? @attilapiros

The original intention of this PR is to be compatible with Hive so I would check both configs as on the same machine I would expect to get the same answers when querying the non partitioned table with subdirectories.

got it, I'll check both configs, thanks!

SparkQA · 2021-04-19T07:46:55Z

Test build #137574 has finished for PR 32202 at commit eb56adb.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-04-19T12:06:05Z

Test build #137598 has finished for PR 32202 at commit 1818fc5.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

FatalLin · 2021-04-19T15:38:08Z

But in spark, the operation is only happened in spark-sql, so I only check hive-side configuration "hive.mapred.supports.subdirectories" earlier. How do you think? @attilapiros

The original intention of this PR is to be compatible with Hive so I would check both configs as on the same machine I would expect to get the same answers when querying the non partitioned table with subdirectories.

got it, I'll check both configs, thanks!

After a consideration( include studying PR from other dev and rethinking point 4 @attilapiros mentioned above), I decided to add a new config to replace the configs from hive we mentioned earlier, but I'm not sure is the config name is proper enough(maybe too long I guess). Like always, any feedback is appreciated!

…partitioned table when configuration is enable

SparkQA · 2021-04-19T16:00:05Z

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42163/

FatalLin · 2021-04-19T16:18:11Z

I didn't get the reason why the Notify test workflow always failed due to some 404 Not Found Exception which I think I didn't change anything will cause it, does anyone have idea on it? Thanks.

@FatalLin - you can rebase to latest master branch, and this error should go away.

it works, thanks.

SparkQA · 2021-04-19T17:16:33Z

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42166/

SparkQA · 2021-04-19T17:49:31Z

Test build #137632 has finished for PR 32202 at commit 88afaf6.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-04-19T21:09:43Z

Test build #137636 has finished for PR 32202 at commit 72eae96.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

attilapiros · 2021-04-20T10:05:17Z

cc @peter-toth

SparkQA · 2021-04-20T10:58:17Z

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42212/

SparkQA · 2021-04-20T14:43:11Z

Test build #137684 has finished for PR 32202 at commit 7cc9c95.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class ImmutableBitSet(val numBits: Int, val bitsToSet: Int*) extends BitSet(numBits)
case class CombinedTypeCoercionRule(rules: Seq[TypeCoercionRule]) extends TypeCoercionRule
case class DomainJoin(domainAttrs: Seq[Attribute], child: LogicalPlan) extends UnaryNode

chong0929 · 2021-05-24T03:35:57Z

I found the same problem with partition Hive tables if they contain subdirectories, so why wasn't it changed in this action?

FatalLin · 2021-05-24T03:46:59Z

I found the same problem with partition Hive tables if they contain subdirectories, so why wasn't it changed in this action?

you mean it will hit the same problem if we trigger the action with hive engine instead of spark native reader?
I thought it could be handled with the hive configuration such like "hive.mapred.supports.subdirectories".

chong0929 · 2021-05-25T06:55:58Z

I found the same problem with partition Hive tables if they contain subdirectories, so why wasn't it changed in this action?

you mean it will hit the same problem if we trigger the action with hive engine instead of spark native reader?
I thought it could be handled with the hive configuration such like "hive.mapred.supports.subdirectories".

I mean there would be an exception: “java.io.IOException: Not a file: hdfs://ns000/{table_name}/month=02/1” if I use spark engine to read a partititioned hive table with subdirectories.

HyukjinKwon · 2021-05-27T04:53:11Z

Can we use https://spark.apache.org/docs/latest/sql-data-sources-generic-options.html#recursive-file-lookup?

FatalLin · 2021-05-27T13:48:15Z

Can we use https://spark.apache.org/docs/latest/sql-data-sources-generic-options.html#recursive-file-lookup?

looks like this question has been replied in another PR.
#32679 (comment)

chong0929 · 2021-05-31T06:37:44Z

I found the same problem with partition Hive tables if they contain subdirectories, so why wasn't it changed in this action?

you mean it will hit the same problem if we trigger the action with hive engine instead of spark native reader?
I thought it could be handled with the hive configuration such like "hive.mapred.supports.subdirectories".

I mean there would be an exception: “java.io.IOException: Not a file: hdfs://ns000/{table_name}/month=02/1” if I use spark engine to read a partititioned hive table with subdirectories.

Confirm that there is no exception, but can not get the data in the partition table subdirectory：#32679

github-actions · 2021-09-09T00:09:21Z

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

github-actions bot added the SQL label Apr 16, 2021

FatalLin changed the title ~~[SPARK-28098]allow reader could read files from subdirectory for non-partitioned table when configuration is enable~~ [SPARK-28098][SQL]allow reader could read files from subdirectory for non-partitioned table when configuration is enable Apr 16, 2021

FatalLin closed this Apr 17, 2021

FatalLin reopened this Apr 17, 2021

attilapiros reviewed Apr 17, 2021

View reviewed changes

sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala Outdated Show resolved Hide resolved

c21 reviewed Apr 18, 2021

View reviewed changes

attilapiros reviewed Apr 18, 2021

View reviewed changes

sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala Outdated Show resolved Hide resolved

brandonlin added 4 commits April 19, 2021 23:43

[SPARK-28098]allow reader could read files from subdirectory for non-…

c53bf9c

…partitioned table when configuration is enable

change to reuse sparkHadoopUtility

8461e42

check both configs to align the behavior in hive

083fe21

add new config to enable the function

72eae96

FatalLin force-pushed the SPARK-28098 branch from 31f2d65 to 72eae96 Compare April 19, 2021 15:43

FatalLin changed the title ~~[SPARK-28098][SQL]allow reader could read files from subdirectory for non-partitioned table when configuration is enable~~ [SPARK-28098][SQL]Supporting non-partitioned Hive tables with subdirectories Apr 20, 2021

Merge branch 'master' into SPARK-28098

7cc9c95

catalinii mentioned this pull request Aug 16, 2021

[SPARK-28098][SQL]Support read partitioned Hive tables with subdirectories lyft/spark#40

Merged

github-actions bot added the Stale label Sep 9, 2021

github-actions bot closed this Sep 10, 2021

zhengchenyu mentioned this pull request Oct 22, 2025

[SPARK-28098][SQL] Supports reading hive tables when there are subdirectories under the table or partition location. #52694

Open

[SPARK-28098][SQL]Supporting non-partitioned Hive tables with subdirectories #32202

[SPARK-28098][SQL]Supporting non-partitioned Hive tables with subdirectories #32202

Uh oh!

Conversation

FatalLin commented Apr 16, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

attilapiros Apr 17, 2021

Choose a reason for hiding this comment

Uh oh!

FatalLin Apr 17, 2021

Choose a reason for hiding this comment

Uh oh!

Uh oh!

attilapiros commented Apr 17, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented Apr 17, 2021

Uh oh!

SparkQA commented Apr 17, 2021

Uh oh!

SparkQA commented Apr 17, 2021

Uh oh!

FatalLin commented Apr 17, 2021

Uh oh!

SparkQA commented Apr 17, 2021

Uh oh!

SparkQA commented Apr 17, 2021

Uh oh!

c21 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

attilapiros commented Apr 18, 2021

Uh oh!

FatalLin commented Apr 18, 2021

Uh oh!

FatalLin commented Apr 19, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

attilapiros commented Apr 19, 2021

Uh oh!

FatalLin commented Apr 19, 2021

Uh oh!

SparkQA commented Apr 19, 2021

Uh oh!

SparkQA commented Apr 19, 2021

Uh oh!

FatalLin commented Apr 19, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented Apr 19, 2021

Uh oh!

FatalLin commented Apr 19, 2021

Uh oh!

SparkQA commented Apr 19, 2021

Uh oh!

SparkQA commented Apr 19, 2021

Uh oh!

SparkQA commented Apr 19, 2021

Uh oh!

attilapiros commented Apr 20, 2021

Uh oh!

SparkQA commented Apr 20, 2021

Uh oh!

SparkQA commented Apr 20, 2021

Uh oh!

chong0929 commented May 24, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

FatalLin commented May 24, 2021

Uh oh!

chong0929 commented May 25, 2021

Uh oh!

HyukjinKwon commented May 27, 2021

Uh oh!

FatalLin commented Apr 16, 2021 •

edited

Loading

attilapiros commented Apr 17, 2021 •

edited

Loading

FatalLin commented Apr 19, 2021 •

edited

Loading

FatalLin commented Apr 19, 2021 •

edited

Loading

chong0929 commented May 24, 2021 •

edited

Loading