[SPARK-28098][SQL]Support read hive table while LeafDir had multi-level paths #32679

chong0929 · 2021-05-26T14:48:59Z

What changes were proposed in this pull request?

This support could read source files of partitioned hive table with subdirectories.

Why are the changes needed?

While use spark engine to read a partititioned hive table with subdirectories, The source files in subdirectories couldn't
get.

Does this PR introduce any user-facing change?

no

How was this patch tested?

new test

HyukjinKwon · 2021-05-27T04:52:39Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InMemoryFileIndex.scala

What about this option? https://spark.apache.org/docs/latest/sql-data-sources-generic-options.html#recursive-file-lookup

This action can't support read partitioned tables, and exception are thrown directly when handling partitioned table subdirectories.

…ories

zhengruifeng · 2021-06-03T10:46:03Z

ok to test

catalinii · 2021-08-20T18:08:07Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InMemoryFileIndex.scala

+    if (rootPaths.contains(path)) {
+      path
+    } else {
+      getRootPathsLeafDir(path.getParent)


This fails with NPE if a parquet file is provided directly (instead of a directory).
The fix: lyft@4375b8a

zinking · 2021-09-06T10:06:48Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileIndexSuite.scala

    }
  }
+
+  test("SPARK-28098 - supporting read partitioned Hive tables with subdirectories") {


not sure what this is testing for

this case also passes on 2.4 without any patched code

github-actions · 2021-12-16T00:11:02Z

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

Baisang · 2022-03-18T21:05:33Z

Any chance of this getting picked up again? I saw it was merged in a fork: lyft#40 but it would be great to have it upstream

FouadApp · 2022-11-04T15:35:46Z

Any chance of this getting picked up again? I saw it was merged in a fork: lyft#40 but it would be great to have it upstream

but, it's not on the official repo (apache/spark) !

FouadApp · 2022-11-04T15:41:54Z

I have the same problem:

With the TEZ engine writing data in the presence of union all:

part_date=xxxx/HIVE_UNION_SUBDIR_1/part_000 (parquet)
part_date=xxxx/HIVE_UNION_SUBDIR_2
part_date=xxxx/HIVE_UNION_SUBDIR_x

when I run a query on this data
df = spark.sql("select * from table")
df.count() ---> 0

Spark cannot read the subdir !

I have a solution biut is not recommmended
spark.conf.set("mapred.input.dir.recursive", "true")
spark.conf.set("mapreduce.input.fileinputformat.input.dir.recursive", "true")
spark.conf.set("spark.sql.hive.convertMetastoreParquet", "false") # This param is not recommended in Spark

zhengchenyu · 2025-10-22T03:23:58Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InMemoryFileIndex.scala

-    cachedLeafDirToChildrenFiles = files.toArray.groupBy(_.getPath.getParent)
+    cachedLeafDirToChildrenFiles =
+      if (readPartitionWithSubdirectoryEnabled) {
+        files.toArray.groupBy(file => getRootPathsLeafDir(file.getPath.getParent))


I found that here can not infer partition for non-catalog table. For table location /dir/table/pt=1/file, here the key is /dir/table, the value is /dir/table/pt=1/file. So we can not infer partition from the key /dir/table.

github-actions bot added the SQL label May 26, 2021

HyukjinKwon reviewed May 27, 2021

View reviewed changes

FatalLin mentioned this pull request May 27, 2021

[SPARK-28098][SQL]Supporting non-partitioned Hive tables with subdirectories #32202

Closed

chong0929 force-pushed the SPARK-28098 branch 2 times, most recently from 5d282a1 to a984c28 Compare May 28, 2021 03:52

[SPARK-28098][SQL]Support read partitioned Hive tables with subdirect…

0c58deb

…ories

chong0929 force-pushed the SPARK-28098 branch from a984c28 to 0c58deb Compare May 28, 2021 07:22

catalinii mentioned this pull request Aug 16, 2021

[SPARK-28098][SQL]Support read partitioned Hive tables with subdirectories lyft/spark#40

Merged

catalinii reviewed Aug 20, 2021

View reviewed changes

zinking reviewed Sep 6, 2021

View reviewed changes

github-actions bot added the Stale label Dec 16, 2021

github-actions bot closed this Dec 17, 2021

chong0929 changed the title ~~[SPARK-28098][SQL]Support read partitioned Hive tables with subdirect…~~ [SPARK-28098][SQL]Support read hive table while LeafDir had multi-level paths Aug 30, 2022

zhengchenyu reviewed Oct 22, 2025

View reviewed changes

zhengchenyu mentioned this pull request Oct 22, 2025

[SPARK-28098][SQL] Supports reading hive tables when there are subdirectories under the table or partition location. #52694

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[SPARK-28098][SQL]Support read hive table while LeafDir had multi-level paths #32679

[SPARK-28098][SQL]Support read hive table while LeafDir had multi-level paths #32679

Uh oh!

chong0929 commented May 26, 2021 •

edited

Loading

Uh oh!

HyukjinKwon May 27, 2021

Uh oh!

chong0929 May 27, 2021

Uh oh!

zhengruifeng commented Jun 3, 2021

Uh oh!

catalinii Aug 20, 2021

Uh oh!

zinking Sep 6, 2021

Uh oh!

github-actions bot commented Dec 16, 2021

Uh oh!

Baisang commented Mar 18, 2022

Uh oh!

FouadApp commented Nov 4, 2022

Uh oh!

FouadApp commented Nov 4, 2022

Uh oh!

zhengchenyu Oct 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

Uh oh!

[SPARK-28098][SQL]Support read hive table while LeafDir had multi-level paths #32679

[SPARK-28098][SQL]Support read hive table while LeafDir had multi-level paths #32679

Uh oh!

Conversation

chong0929 commented May 26, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

HyukjinKwon May 27, 2021

Choose a reason for hiding this comment

Uh oh!

chong0929 May 27, 2021

Choose a reason for hiding this comment

Uh oh!

zhengruifeng commented Jun 3, 2021

Uh oh!

catalinii Aug 20, 2021

Choose a reason for hiding this comment

Uh oh!

zinking Sep 6, 2021

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Dec 16, 2021

Uh oh!

Baisang commented Mar 18, 2022

Uh oh!

FouadApp commented Nov 4, 2022

Uh oh!

FouadApp commented Nov 4, 2022

Uh oh!

zhengchenyu Oct 22, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

chong0929 commented May 26, 2021 •

edited

Loading