-
Couldn't load subscription status.
- Fork 28.9k
[SPARK-28098][SQL]Support read hive table while LeafDir had multi-level paths #32679
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about this option? https://spark.apache.org/docs/latest/sql-data-sources-generic-options.html#recursive-file-lookup
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This action can't support read partitioned tables, and exception are thrown directly when handling partitioned table subdirectories.
5d282a1 to
a984c28
Compare
|
ok to test |
| if (rootPaths.contains(path)) { | ||
| path | ||
| } else { | ||
| getRootPathsLeafDir(path.getParent) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This fails with NPE if a parquet file is provided directly (instead of a directory).
The fix: lyft@4375b8a
| } | ||
| } | ||
|
|
||
| test("SPARK-28098 - supporting read partitioned Hive tables with subdirectories") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure what this is testing for
this case also passes on 2.4 without any patched code
|
We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. |
|
Any chance of this getting picked up again? I saw it was merged in a fork: lyft#40 but it would be great to have it upstream |
but, it's not on the official repo (apache/spark) ! |
|
I have the same problem: With the TEZ engine writing data in the presence of union all: part_date=xxxx/HIVE_UNION_SUBDIR_1/part_000 (parquet) when I run a query on this data Spark cannot read the subdir ! I have a solution biut is not recommmended |
| cachedLeafDirToChildrenFiles = files.toArray.groupBy(_.getPath.getParent) | ||
| cachedLeafDirToChildrenFiles = | ||
| if (readPartitionWithSubdirectoryEnabled) { | ||
| files.toArray.groupBy(file => getRootPathsLeafDir(file.getPath.getParent)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found that here can not infer partition for non-catalog table. For table location /dir/table/pt=1/file, here the key is /dir/table, the value is /dir/table/pt=1/file. So we can not infer partition from the key /dir/table.
What changes were proposed in this pull request?
This support could read source files of partitioned hive table with subdirectories.
Why are the changes needed?
While use spark engine to read a partititioned hive table with subdirectories, The source files in subdirectories couldn't
get.
Does this PR introduce any user-facing change?
no
How was this patch tested?
new test