-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-6923][SPARK-7550][SQL] Hive MetaStore API cannot access Data Sourced table schema correctly #5733
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #31117 has finished for PR 5733 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about datatypes that hive does not support like MLlib vectors?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's a good question, ideally, we should throw exception if we need to put the unsupported schema (in Hive) into Hive Metastore, right? Otherwise it always break the other application like Hive/Pig when they share the same metastore. That's why I put a TODO below, probably we need to provide the Hive Storage Handler for DataSource API, it will be great helpful for the legacy systems.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can't limit ourselves to the lowest common denominator. I would be okay with lying about the schema in this case and printing a warning or something that it might be incompatibly with other systems.
|
Here are my two cents:
|
1eebb46 to
16bc38a
Compare
|
Test build #34913 has finished for PR 5733 at commit
|
16bc38a to
76027a5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really need this configuration key?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is probably a reason that we need the configuration key "spark.sql.hive.writeDataSourceSchema"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't quite get it...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For example, Hive don't support multiple input paths for data source, but Spark SQL does, we will throws exception if it's the case. So people can continue processing the multi-paths data source by simply disable this feature.
|
Test build #37598 has finished for PR 5733 at commit
|
|
Test build #37614 has finished for PR 5733 at commit
|
4cb3656 to
3b1a39a
Compare
|
cc @liancheng |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After the change, these comments don't make grammatical sense, and I'm not sure what information they are trying to convey.
|
Test build #37787 has finished for PR 5733 at commit
|
3b1a39a to
6d0e845
Compare
|
Test build #38491 has finished for PR 5733 at commit
|
|
Test build #38492 has finished for PR 5733 at commit
|
|
@liancheng don't forget to review this? :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function always persists table metadata into Hive's metastore. But the table is not accessible from Hive unless the underlying data source is either Parquet or ORC.
|
@chenghao-intel Please help bringing this PR up to date. Thanks! |
07570e9 to
c4ec806
Compare
|
Test build #38801 has finished for PR 5733 at commit
|
|
Test build #38805 has finished for PR 5733 at commit
|
|
@liancheng any more comments? |
|
Refactored this PR with chenghao-intel#2. Major changes:
|
|
@chenghao-intel Please help updating the PR description. Thanks! |
|
Updated! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this still true?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, it's not anymore, I forgot to update the comment in my PR...
|
Test build #39868 has finished for PR 5733 at commit
|
|
I'm opening another PR based on this one to fix the build error as well as the comments. |
|
Thank you @liancheng I am closing this PR. |
…e compatible format when possible This PR is a fork of PR #5733 authored by chenghao-intel. For committers who's going to merge this PR, please set the author to "Cheng Hao <hao.chengintel.com>". ---- When a data source relation meets the following requirements, we persist it in Hive compatible format, so that other systems like Hive can access it: 1. It's a `HadoopFsRelation` 2. It has only one input path 3. It's non-partitioned 4. It's data source provider can be naturally mapped to a Hive builtin SerDe (e.g. ORC and Parquet) Author: Cheng Lian <lian@databricks.com> Author: Cheng Hao <hao.cheng@intel.com> Closes #7967 from liancheng/spark-6923/refactoring-pr-5733 and squashes the following commits: 5175ee6 [Cheng Lian] Fixes an oudated comment 3870166 [Cheng Lian] Fixes build error and comments 864acee [Cheng Lian] Refactors PR #5733 3490cdc [Cheng Hao] update the scaladoc 6f57669 [Cheng Hao] write schema info to hivemetastore for data source
…e compatible format when possible This PR is a fork of PR #5733 authored by chenghao-intel. For committers who's going to merge this PR, please set the author to "Cheng Hao <hao.chengintel.com>". ---- When a data source relation meets the following requirements, we persist it in Hive compatible format, so that other systems like Hive can access it: 1. It's a `HadoopFsRelation` 2. It has only one input path 3. It's non-partitioned 4. It's data source provider can be naturally mapped to a Hive builtin SerDe (e.g. ORC and Parquet) Author: Cheng Lian <lian@databricks.com> Author: Cheng Hao <hao.cheng@intel.com> Closes #7967 from liancheng/spark-6923/refactoring-pr-5733 and squashes the following commits: 5175ee6 [Cheng Lian] Fixes an oudated comment 3870166 [Cheng Lian] Fixes build error and comments 864acee [Cheng Lian] Refactors PR #5733 3490cdc [Cheng Hao] update the scaladoc 6f57669 [Cheng Hao] write schema info to hivemetastore for data source (cherry picked from commit 119b590) Signed-off-by: Reynold Xin <rxin@databricks.com>
Always persist the data source relation in Hive compatible format when possible, and give warning logs to indicate why we can't do this. Now we only persist non-partitioned HadoopFsRelation with a single input path.
Known issues (will do in the following PRs):
reada Spark SQL parquet data sourced table.In long term, as @liancheng suggested, we'd better provide a Hive StoregeHandler to bridge the Spark SQL data source.