[SPARK-6923][SPARK-7550][SQL] Hive MetaStore API cannot access Data Sourced table schema correctly #5733

chenghao-intel · 2015-04-28T05:22:02Z

Always persist the data source relation in Hive compatible format when possible, and give warning logs to indicate why we can't do this. Now we only persist non-partitioned HadoopFsRelation with a single input path.

Known issues (will do in the following PRs):

As parquet in Spark SQL is based on 1.7.0, but Hive is supposed to be 1.3.2, it's will cause exception for Hive to read a Spark SQL parquet data sourced table.
JSON / JDBC data source is not a built-in function for Hive, we will not handle it either.
Spark SQL probably not effect if user alter the table schema via Hive.
The original PR only persists partition columns information without adding individual partitions, so persisting partitioned tables actually doesn't work.

In long term, as @liancheng suggested, we'd better provide a Hive StoregeHandler to bridge the Spark SQL data source.

SparkQA · 2015-04-28T06:03:40Z

Test build #31117 has finished for PR 5733 at commit 1eebb46.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.
This patch does not change any dependencies.

marmbrus · 2015-04-28T18:14:59Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala

What about datatypes that hive does not support like MLlib vectors?

Yes, that's a good question, ideally, we should throw exception if we need to put the unsupported schema (in Hive) into Hive Metastore, right? Otherwise it always break the other application like Hive/Pig when they share the same metastore. That's why I put a TODO below, probably we need to provide the Hive Storage Handler for DataSource API, it will be great helpful for the legacy systems.

We can't limit ourselves to the lowest common denominator. I would be okay with lying about the schema in this case and printing a warning or something that it might be incompatibly with other systems.

liancheng · 2015-05-12T17:26:55Z

Here are my two cents:

Short term: we simply don't allow data source tables persisted in Hive metastore to be accessible from Hive client API.
Mid term: maybe a Hive storage handler similar to the one for HBase, responsible for translating Spark SQL data source table properties stored as Hive metastore SerDe properties to corresponding Hive concepts.
Long term:
- Migrate Spark SQL Hive support to a separate external data source
- Have Spark SQL's own metastore service
- The storage handler mentioned above can be used to interacting with this metastore service
In a word, ideally, Hive is an external data source of Spark SQL via Spark SQL's external data sources API, while Spark SQL can also be viewed as a data source of Hive via Hive's storage handler API.

SparkQA · 2015-06-15T08:16:58Z

Test build #34913 has finished for PR 5733 at commit 16bc38a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

chenghao-intel · 2015-07-17T07:20:54Z

sql/core/src/main/scala/org/apache/spark/sql/SQLConf.scala

Do we really need this configuration key?

I don't think this is necessary.

chenghao-intel · 2015-07-17T07:47:51Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala

This is probably a reason that we need the configuration key "spark.sql.hive.writeDataSourceSchema"

I don't quite get it...

For example, Hive don't support multiple input paths for data source, but Spark SQL does, we will throws exception if it's the case. So people can continue processing the multi-paths data source by simply disable this feature.

SparkQA · 2015-07-17T08:26:58Z

Test build #37598 has finished for PR 5733 at commit 76027a5.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-07-17T10:16:43Z

Test build #37614 has finished for PR 5733 at commit 4cb3656.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

chenghao-intel · 2015-07-20T01:52:20Z

cc @liancheng

markhamstra · 2015-07-20T02:20:35Z

sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala

After the change, these comments don't make grammatical sense, and I'm not sure what information they are trying to convey.

SparkQA · 2015-07-20T03:25:32Z

Test build #37787 has finished for PR 5733 at commit 3b1a39a.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class ConcatWs(children: Seq[Expression])

SparkQA · 2015-07-27T03:22:46Z

Test build #38491 has finished for PR 5733 at commit 6d0e845.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-07-27T03:25:51Z

Test build #38492 has finished for PR 5733 at commit 07570e9.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

chenghao-intel · 2015-07-27T04:25:09Z

@liancheng don't forget to review this? :)

liancheng · 2015-07-28T16:22:57Z

sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala

This function always persists table metadata into Hive's metastore. But the table is not accessible from Hive unless the underlying data source is either Parquet or ORC.

liancheng · 2015-07-28T16:53:20Z

@chenghao-intel Please help bringing this PR up to date. Thanks!

SparkQA · 2015-07-29T06:53:51Z

Test build #38801 has finished for PR 5733 at commit c4ec806.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-07-29T07:02:05Z

Test build #38805 has finished for PR 5733 at commit e05e119.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

chenghao-intel · 2015-07-30T01:17:33Z

@liancheng any more comments?

Refactors PR apache#5733

liancheng · 2015-08-05T14:33:31Z

Refactored this PR with chenghao-intel#2. Major changes:

Remove spark.sql.hive.writeDataSourceSchema.
Always persist the data source relation in Hive compatible format when possible, and give warning logs to indicate why we can't do this.
Now we ony persist non-partitioned HadoopFsRelation with a single input path. The original PR only persists partition columns information without adding individual partitions, so persisting partitioned tables actually doesn't work.
Refactor test cases.

liancheng · 2015-08-05T14:43:08Z

@chenghao-intel Please help updating the PR description. Thanks!

chenghao-intel · 2015-08-05T15:58:32Z

Updated!

marmbrus · 2015-08-05T16:50:59Z

sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala

is this still true?

Oh, it's not anymore, I forgot to update the comment in my PR...

SparkQA · 2015-08-05T17:01:09Z

Test build #39868 has finished for PR 5733 at commit bc3cd73.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

liancheng · 2015-08-05T17:39:19Z

I'm opening another PR based on this one to fix the build error as well as the comments.

liancheng · 2015-08-05T18:11:35Z

@marmbrus Just opened #7967 to supersede this one.

chenghao-intel · 2015-08-06T00:23:34Z

Thank you @liancheng I am closing this PR.

…e compatible format when possible This PR is a fork of PR #5733 authored by chenghao-intel. For committers who's going to merge this PR, please set the author to "Cheng Hao <hao.chengintel.com>". ---- When a data source relation meets the following requirements, we persist it in Hive compatible format, so that other systems like Hive can access it: 1. It's a `HadoopFsRelation` 2. It has only one input path 3. It's non-partitioned 4. It's data source provider can be naturally mapped to a Hive builtin SerDe (e.g. ORC and Parquet) Author: Cheng Lian <lian@databricks.com> Author: Cheng Hao <hao.cheng@intel.com> Closes #7967 from liancheng/spark-6923/refactoring-pr-5733 and squashes the following commits: 5175ee6 [Cheng Lian] Fixes an oudated comment 3870166 [Cheng Lian] Fixes build error and comments 864acee [Cheng Lian] Refactors PR #5733 3490cdc [Cheng Hao] update the scaladoc 6f57669 [Cheng Hao] write schema info to hivemetastore for data source

…e compatible format when possible This PR is a fork of PR #5733 authored by chenghao-intel. For committers who's going to merge this PR, please set the author to "Cheng Hao <hao.chengintel.com>". ---- When a data source relation meets the following requirements, we persist it in Hive compatible format, so that other systems like Hive can access it: 1. It's a `HadoopFsRelation` 2. It has only one input path 3. It's non-partitioned 4. It's data source provider can be naturally mapped to a Hive builtin SerDe (e.g. ORC and Parquet) Author: Cheng Lian <lian@databricks.com> Author: Cheng Hao <hao.cheng@intel.com> Closes #7967 from liancheng/spark-6923/refactoring-pr-5733 and squashes the following commits: 5175ee6 [Cheng Lian] Fixes an oudated comment 3870166 [Cheng Lian] Fixes build error and comments 864acee [Cheng Lian] Refactors PR #5733 3490cdc [Cheng Hao] update the scaladoc 6f57669 [Cheng Hao] write schema info to hivemetastore for data source (cherry picked from commit 119b590) Signed-off-by: Reynold Xin <rxin@databricks.com>

marmbrus reviewed Apr 28, 2015
View reviewed changes

chenghao-intel force-pushed the SPARK-6923 branch from 1eebb46 to 16bc38a Compare June 15, 2015 06:29

chenghao-intel changed the title ~~[SPARK-6923] [SQL] Hive MetaStore API cannot access Data Sourced table schema correctly~~ [SPARK-6923][SPARK-7550][SQL][WIP] Hive MetaStore API cannot access Data Sourced table schema correctly Jun 15, 2015

chenghao-intel force-pushed the SPARK-6923 branch from 16bc38a to 76027a5 Compare July 17, 2015 07:07

chenghao-intel reviewed Jul 17, 2015
View reviewed changes

chenghao-intel changed the title ~~[SPARK-6923][SPARK-7550][SQL][WIP] Hive MetaStore API cannot access Data Sourced table schema correctly~~ [SPARK-6923][SPARK-7550][SQL] Hive MetaStore API cannot access Data Sourced table schema correctly Jul 17, 2015

chenghao-intel reviewed Jul 17, 2015
View reviewed changes

chenghao-intel force-pushed the SPARK-6923 branch from 4cb3656 to 3b1a39a Compare July 20, 2015 01:50

markhamstra reviewed Jul 20, 2015
View reviewed changes

chenghao-intel force-pushed the SPARK-6923 branch from 3b1a39a to 6d0e845 Compare July 27, 2015 01:42

liancheng reviewed Jul 28, 2015
View reviewed changes

write schema info to hivemetastore for data source

c4ec806

chenghao-intel force-pushed the SPARK-6923 branch from 07570e9 to c4ec806 Compare July 29, 2015 05:12

update the scaladoc

e05e119

liancheng and others added 2 commits August 5, 2015 22:08

Refactors PR apache#5733

5f1ab4a

Merge pull request #2 from liancheng/spark-6923/refactoring-pr-5733

bc3cd73

Refactors PR apache#5733

marmbrus reviewed Aug 5, 2015
View reviewed changes

liancheng added a commit to liancheng/spark that referenced this pull request Aug 5, 2015

Refactors PR apache#5733

864acee

liancheng mentioned this pull request Aug 5, 2015

[SPARK-6923] [SPARK-7550] [SQL] Persists data source relations in Hive compatible format when possible #7967

Closed

chenghao-intel closed this Aug 6, 2015

chenghao-intel mentioned this pull request Aug 27, 2015

[SPARK-9170][SQL] Use OrcStructInspector to be case preserving when writing ORC files #7520

Closed

[SPARK-6923][SPARK-7550][SQL] Hive MetaStore API cannot access Data Sourced table schema correctly #5733

[SPARK-6923][SPARK-7550][SQL] Hive MetaStore API cannot access Data Sourced table schema correctly #5733

Uh oh!

Conversation

chenghao-intel commented Apr 28, 2015

Uh oh!

SparkQA commented Apr 28, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

liancheng commented May 12, 2015

Uh oh!

SparkQA commented Jun 15, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jul 17, 2015

Uh oh!

SparkQA commented Jul 17, 2015

Uh oh!

chenghao-intel commented Jul 20, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jul 20, 2015

Uh oh!

SparkQA commented Jul 27, 2015

Uh oh!

SparkQA commented Jul 27, 2015

Uh oh!

chenghao-intel commented Jul 27, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

liancheng commented Jul 28, 2015

Uh oh!

SparkQA commented Jul 29, 2015

Uh oh!

SparkQA commented Jul 29, 2015

Uh oh!

chenghao-intel commented Jul 30, 2015

Uh oh!

liancheng commented Aug 5, 2015

Uh oh!

liancheng commented Aug 5, 2015

Uh oh!

chenghao-intel commented Aug 5, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Aug 5, 2015

Uh oh!

liancheng commented Aug 5, 2015

Uh oh!

liancheng commented Aug 5, 2015

Uh oh!

chenghao-intel commented Aug 6, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants