Skip to content

Conversation

@FatalLin
Copy link

@FatalLin FatalLin commented Apr 16, 2021

What changes were proposed in this pull request?

In this PR, I proposed a function to allow HiveMetastoreCatalog could read source files of non-partitioned hive table with subdirectories when new configuration "spark.sql.nonPartitionedTable.subdirectory.read.enabled" is set.

Why are the changes needed?

Hive already have configurations to handle similar issues, but the built-in reader couldn't.

Does this PR introduce any user-facing change?

Yes, the new configuration "spark.sql.nonPartitionedTable.subdirectory.read.enabled" has been added and the default value is "false."

How was this patch tested?

New tests.

@github-actions github-actions bot added the SQL label Apr 16, 2021
@FatalLin FatalLin changed the title [SPARK-28098]allow reader could read files from subdirectory for non-partitioned table when configuration is enable [SPARK-28098][SQL]allow reader could read files from subdirectory for non-partitioned table when configuration is enable Apr 16, 2021
@FatalLin FatalLin closed this Apr 17, 2021
@FatalLin FatalLin reopened this Apr 17, 2021
Comment on lines 287 to 300
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not using the existing SparkHadoopUtil.listLeafDirStatuses(fs, rootPath)?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't notice the function before, will change it. Thanks.

@attilapiros
Copy link
Contributor

attilapiros commented Apr 17, 2021

@FatalLin Thanks for your contribution! Welcome here!

This is not my focus area but I have added some comments. So let's cc. some more competent developers:
@dongjoon-hyun @viirya

But of course I can enable the testing for you.

ok to test

@SparkQA
Copy link

SparkQA commented Apr 17, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42079/

@SparkQA
Copy link

SparkQA commented Apr 17, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42079/

@SparkQA
Copy link

SparkQA commented Apr 17, 2021

Test build #137505 has finished for PR 32202 at commit eb56adb.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@FatalLin
Copy link
Author

I didn't get the reason why the Notify test workflow always failed due to some 404 Not Found Exception which I think I didn't change anything will cause it, does anyone have idea on it? Thanks.

@SparkQA
Copy link

SparkQA commented Apr 17, 2021

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42081/

@SparkQA
Copy link

SparkQA commented Apr 17, 2021

Test build #137507 has finished for PR 32202 at commit 1818fc5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

@c21 c21 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't get the reason why the Notify test workflow always failed due to some 404 Not Found Exception which I think I didn't change anything will cause it, does anyone have idea on it? Thanks.

@FatalLin - you can rebase to latest master branch, and this error should go away.

@attilapiros
Copy link
Contributor

@FatalLin
Some more thoughts/question:

  1. Why are two configs in Hive for this?
  • mapred.input.dir.recursive
  • hive.mapred.supports.subdirectories
  1. How does Hive do when only one is true? If there both needed we need to check both too!

  2. Please update the title: skip the part "when configuration is enable" and reword the rest.
    What about "Supporting non-partitioned Hive tables with subdirectories".

  3. Please update the description, too. In "What changes were proposed in this pull request?" its enough if you explain the the title a bit more. I suggest to use a spell checker to avoid errors like: setted => set, configurtions => configuration.
    Please note the PR description is extremely important as after the PR is merged it will be the commit message.

  4. At "Does this PR introduce any user-facing change?" elaborate on the impact of this change. Remove the "maybe we could add this option in documents to notice users for the enhancement." which I think is a good idea and should be part of this PR.

@FatalLin
Copy link
Author

@FatalLin
Some more thoughts/question:

  1. Why are two configs in Hive for this?
  • mapred.input.dir.recursive
  • hive.mapred.supports.subdirectories
  1. How does Hive do when only one is true? If there both needed we need to check both too!
  2. Please update the title: skip the part "when configuration is enable" and reword the rest.
    What about "Supporting non-partitioned Hive tables with subdirectories".
  3. Please update the description, too. In "What changes were proposed in this pull request?" its enough if you explain the the title a bit more. I suggest to use a spell checker to avoid errors like: setted => set, configurtions => configuration.
    Please note the PR description is extremely important as after the PR is merged it will be the commit message.
  4. At "Does this PR introduce any user-facing change?" elaborate on the impact of this change. Remove the "maybe we could add this option in documents to notice users for the enhancement." which I think is a good idea and should be part of this PR.

got it, it's a great help for me, really appreciated!
I'll address all the questions you mentioned.

@FatalLin
Copy link
Author

FatalLin commented Apr 19, 2021

about the configuration "mapred.input.dir.recursive" and "hive.mapred.supports.subdirectories", I found a brief introduction in hive document:
hive.mapred.supports.subdirectories Default Value: false Added In: Hive 0.10.0 with HIVE-3276 Whether the version of Hadoop which is running supports sub-directories for tables/partitions. Many Hive optimizations can be applied if the Hadoop version supports sub-directories for tables/partitions. This support was added by MAPREDUCE-1501. (https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties)
looks like "mapred.input.dir.recursive" allow map-reduce could read files from sub-directories, and "hive.mapred.supports.subdirectories" allow hive could do some sub-directories related optimization. In my first thought that due to hive and map-reduce is separate project so that's make sense that they have each own configuration about it. But in spark, the operation is only happened in spark-sql, so I only check hive-side configuration "hive.mapred.supports.subdirectories" earlier. How do you think? @attilapiros

@attilapiros
Copy link
Contributor

But in spark, the operation is only happened in spark-sql, so I only check hive-side configuration "hive.mapred.supports.subdirectories" earlier. How do you think? @attilapiros

The original intention of this PR is to be compatible with Hive so I would check both configs as on the same machine I would expect to get the same answers when querying the non partitioned table with subdirectories.

@FatalLin
Copy link
Author

But in spark, the operation is only happened in spark-sql, so I only check hive-side configuration "hive.mapred.supports.subdirectories" earlier. How do you think? @attilapiros

The original intention of this PR is to be compatible with Hive so I would check both configs as on the same machine I would expect to get the same answers when querying the non partitioned table with subdirectories.

got it, I'll check both configs, thanks!

@SparkQA
Copy link

SparkQA commented Apr 19, 2021

Test build #137574 has finished for PR 32202 at commit eb56adb.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 19, 2021

Test build #137598 has finished for PR 32202 at commit 1818fc5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@FatalLin
Copy link
Author

FatalLin commented Apr 19, 2021

But in spark, the operation is only happened in spark-sql, so I only check hive-side configuration "hive.mapred.supports.subdirectories" earlier. How do you think? @attilapiros

The original intention of this PR is to be compatible with Hive so I would check both configs as on the same machine I would expect to get the same answers when querying the non partitioned table with subdirectories.

got it, I'll check both configs, thanks!

After a consideration( include studying PR from other dev and rethinking point 4 @attilapiros mentioned above), I decided to add a new config to replace the configs from hive we mentioned earlier, but I'm not sure is the config name is proper enough(maybe too long I guess). Like always, any feedback is appreciated!

@SparkQA
Copy link

SparkQA commented Apr 19, 2021

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42163/

@FatalLin
Copy link
Author

I didn't get the reason why the Notify test workflow always failed due to some 404 Not Found Exception which I think I didn't change anything will cause it, does anyone have idea on it? Thanks.

@FatalLin - you can rebase to latest master branch, and this error should go away.

it works, thanks.

@SparkQA
Copy link

SparkQA commented Apr 19, 2021

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42166/

@SparkQA
Copy link

SparkQA commented Apr 19, 2021

Test build #137632 has finished for PR 32202 at commit 88afaf6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 19, 2021

Test build #137636 has finished for PR 32202 at commit 72eae96.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@FatalLin FatalLin changed the title [SPARK-28098][SQL]allow reader could read files from subdirectory for non-partitioned table when configuration is enable [SPARK-28098][SQL]Supporting non-partitioned Hive tables with subdirectories Apr 20, 2021
@attilapiros
Copy link
Contributor

cc @peter-toth

@SparkQA
Copy link

SparkQA commented Apr 20, 2021

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42212/

@SparkQA
Copy link

SparkQA commented Apr 20, 2021

Test build #137684 has finished for PR 32202 at commit 7cc9c95.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class ImmutableBitSet(val numBits: Int, val bitsToSet: Int*) extends BitSet(numBits)
  • case class CombinedTypeCoercionRule(rules: Seq[TypeCoercionRule]) extends TypeCoercionRule
  • case class DomainJoin(domainAttrs: Seq[Attribute], child: LogicalPlan) extends UnaryNode

@chong0929
Copy link
Contributor

chong0929 commented May 24, 2021

I found the same problem with partition Hive tables if they contain subdirectories, so why wasn't it changed in this action?

@FatalLin
Copy link
Author

I found the same problem with partition Hive tables if they contain subdirectories, so why wasn't it changed in this action?

you mean it will hit the same problem if we trigger the action with hive engine instead of spark native reader?
I thought it could be handled with the hive configuration such like "hive.mapred.supports.subdirectories".

@chong0929
Copy link
Contributor

I found the same problem with partition Hive tables if they contain subdirectories, so why wasn't it changed in this action?

you mean it will hit the same problem if we trigger the action with hive engine instead of spark native reader?
I thought it could be handled with the hive configuration such like "hive.mapred.supports.subdirectories".

I mean there would be an exception: “java.io.IOException: Not a file: hdfs://ns000/{table_name}/month=02/1” if I use spark engine to read a partititioned hive table with subdirectories.

@HyukjinKwon
Copy link
Member

@FatalLin
Copy link
Author

Can we use https://spark.apache.org/docs/latest/sql-data-sources-generic-options.html#recursive-file-lookup?

looks like this question has been replied in another PR.
#32679 (comment)

@chong0929
Copy link
Contributor

I found the same problem with partition Hive tables if they contain subdirectories, so why wasn't it changed in this action?

you mean it will hit the same problem if we trigger the action with hive engine instead of spark native reader?
I thought it could be handled with the hive configuration such like "hive.mapred.supports.subdirectories".

I mean there would be an exception: “java.io.IOException: Not a file: hdfs://ns000/{table_name}/month=02/1” if I use spark engine to read a partititioned hive table with subdirectories.

Confirm that there is no exception, but can not get the data in the partition table subdirectory:#32679

@github-actions
Copy link

github-actions bot commented Sep 9, 2021

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants