Skip to content

Conversation

@chong0929
Copy link
Contributor

@chong0929 chong0929 commented May 26, 2021

What changes were proposed in this pull request?

This support could read source files of partitioned hive table with subdirectories.

Why are the changes needed?

While use spark engine to read a partititioned hive table with subdirectories, The source files in subdirectories couldn't
get.

Does this PR introduce any user-facing change?

no

How was this patch tested?

new test

@github-actions github-actions bot added the SQL label May 26, 2021
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This action can't support read partitioned tables, and exception are thrown directly when handling partitioned table subdirectories.

@zhengruifeng
Copy link
Contributor

ok to test

if (rootPaths.contains(path)) {
path
} else {
getRootPathsLeafDir(path.getParent)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This fails with NPE if a parquet file is provided directly (instead of a directory).
The fix: lyft@4375b8a

}
}

test("SPARK-28098 - supporting read partitioned Hive tables with subdirectories") {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure what this is testing for

this case also passes on 2.4 without any patched code

@github-actions
Copy link

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

@github-actions github-actions bot added the Stale label Dec 16, 2021
@github-actions github-actions bot closed this Dec 17, 2021
@Baisang
Copy link

Baisang commented Mar 18, 2022

Any chance of this getting picked up again? I saw it was merged in a fork: lyft#40 but it would be great to have it upstream

@chong0929 chong0929 changed the title [SPARK-28098][SQL]Support read partitioned Hive tables with subdirect… [SPARK-28098][SQL]Support read hive table while LeafDir had multi-level paths Aug 30, 2022
@FouadApp
Copy link

FouadApp commented Nov 4, 2022

Any chance of this getting picked up again? I saw it was merged in a fork: lyft#40 but it would be great to have it upstream

but, it's not on the official repo (apache/spark) !

@FouadApp
Copy link

FouadApp commented Nov 4, 2022

I have the same problem:

With the TEZ engine writing data in the presence of union all:

part_date=xxxx/HIVE_UNION_SUBDIR_1/part_000 (parquet)
part_date=xxxx/HIVE_UNION_SUBDIR_2
part_date=xxxx/HIVE_UNION_SUBDIR_x

when I run a query on this data
df = spark.sql("select * from table")
df.count() ---> 0

Spark cannot read the subdir !

I have a solution biut is not recommmended
spark.conf.set("mapred.input.dir.recursive", "true")
spark.conf.set("mapreduce.input.fileinputformat.input.dir.recursive", "true")
spark.conf.set("spark.sql.hive.convertMetastoreParquet", "false") # This param is not recommended in Spark

cachedLeafDirToChildrenFiles = files.toArray.groupBy(_.getPath.getParent)
cachedLeafDirToChildrenFiles =
if (readPartitionWithSubdirectoryEnabled) {
files.toArray.groupBy(file => getRootPathsLeafDir(file.getPath.getParent))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found that here can not infer partition for non-catalog table. For table location /dir/table/pt=1/file, here the key is /dir/table, the value is /dir/table/pt=1/file. So we can not infer partition from the key /dir/table.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants