-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-28098][SQL]Supporting non-partitioned Hive tables with subdirectories #32202
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not using the existing SparkHadoopUtil.listLeafDirStatuses(fs, rootPath)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Didn't notice the function before, will change it. Thanks.
sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala
Outdated
Show resolved
Hide resolved
|
@FatalLin Thanks for your contribution! Welcome here! This is not my focus area but I have added some comments. So let's cc. some more competent developers: But of course I can enable the testing for you. ok to test |
|
Kubernetes integration test starting |
|
Kubernetes integration test status failure |
|
Test build #137505 has finished for PR 32202 at commit
|
|
I didn't get the reason why the Notify test workflow always failed due to some 404 Not Found Exception which I think I didn't change anything will cause it, does anyone have idea on it? Thanks. |
|
Kubernetes integration test unable to build dist. exiting with code: 1 |
|
Test build #137507 has finished for PR 32202 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't get the reason why the Notify test workflow always failed due to some 404 Not Found Exception which I think I didn't change anything will cause it, does anyone have idea on it? Thanks.
@FatalLin - you can rebase to latest master branch, and this error should go away.
sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala
Outdated
Show resolved
Hide resolved
|
@FatalLin
|
got it, it's a great help for me, really appreciated! |
|
about the configuration "mapred.input.dir.recursive" and "hive.mapred.supports.subdirectories", I found a brief introduction in hive document: |
The original intention of this PR is to be compatible with Hive so I would check both configs as on the same machine I would expect to get the same answers when querying the non partitioned table with subdirectories. |
got it, I'll check both configs, thanks! |
|
Test build #137574 has finished for PR 32202 at commit
|
|
Test build #137598 has finished for PR 32202 at commit
|
After a consideration( include studying PR from other dev and rethinking point 4 @attilapiros mentioned above), I decided to add a new config to replace the configs from hive we mentioned earlier, but I'm not sure is the config name is proper enough(maybe too long I guess). Like always, any feedback is appreciated! |
…partitioned table when configuration is enable
|
Kubernetes integration test unable to build dist. exiting with code: 1 |
it works, thanks. |
|
Kubernetes integration test unable to build dist. exiting with code: 1 |
|
Test build #137632 has finished for PR 32202 at commit
|
|
Test build #137636 has finished for PR 32202 at commit
|
|
cc @peter-toth |
|
Kubernetes integration test unable to build dist. exiting with code: 1 |
|
Test build #137684 has finished for PR 32202 at commit
|
|
I found the same problem with partition Hive tables if they contain subdirectories, so why wasn't it changed in this action? |
you mean it will hit the same problem if we trigger the action with hive engine instead of spark native reader? |
I mean there would be an exception: “java.io.IOException: Not a file: hdfs://ns000/{table_name}/month=02/1” if I use spark engine to read a partititioned hive table with subdirectories. |
looks like this question has been replied in another PR. |
Confirm that there is no exception, but can not get the data in the partition table subdirectory:#32679 |
|
We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. |
What changes were proposed in this pull request?
In this PR, I proposed a function to allow HiveMetastoreCatalog could read source files of non-partitioned hive table with subdirectories when new configuration "spark.sql.nonPartitionedTable.subdirectory.read.enabled" is set.
Why are the changes needed?
Hive already have configurations to handle similar issues, but the built-in reader couldn't.
Does this PR introduce any user-facing change?
Yes, the new configuration "spark.sql.nonPartitionedTable.subdirectory.read.enabled" has been added and the default value is "false."
How was this patch tested?
New tests.