Skip to content

Conversation

@chenghao-intel
Copy link
Contributor

Always persist the data source relation in Hive compatible format when possible, and give warning logs to indicate why we can't do this. Now we only persist non-partitioned HadoopFsRelation with a single input path.

Known issues (will do in the following PRs):

  • As parquet in Spark SQL is based on 1.7.0, but Hive is supposed to be 1.3.2, it's will cause exception for Hive to read a Spark SQL parquet data sourced table.
  • JSON / JDBC data source is not a built-in function for Hive, we will not handle it either.
  • Spark SQL probably not effect if user alter the table schema via Hive.
  • The original PR only persists partition columns information without adding individual partitions, so persisting partitioned tables actually doesn't work.

In long term, as @liancheng suggested, we'd better provide a Hive StoregeHandler to bridge the Spark SQL data source.

@SparkQA
Copy link

SparkQA commented Apr 28, 2015

Test build #31117 has finished for PR 5733 at commit 1eebb46.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
  • This patch does not change any dependencies.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about datatypes that hive does not support like MLlib vectors?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's a good question, ideally, we should throw exception if we need to put the unsupported schema (in Hive) into Hive Metastore, right? Otherwise it always break the other application like Hive/Pig when they share the same metastore. That's why I put a TODO below, probably we need to provide the Hive Storage Handler for DataSource API, it will be great helpful for the legacy systems.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't limit ourselves to the lowest common denominator. I would be okay with lying about the schema in this case and printing a warning or something that it might be incompatibly with other systems.

@liancheng
Copy link
Contributor

Here are my two cents:

  1. Short term: we simply don't allow data source tables persisted in Hive metastore to be accessible from Hive client API.

  2. Mid term: maybe a Hive storage handler similar to the one for HBase, responsible for translating Spark SQL data source table properties stored as Hive metastore SerDe properties to corresponding Hive concepts.

  3. Long term:

    • Migrate Spark SQL Hive support to a separate external data source
    • Have Spark SQL's own metastore service
    • The storage handler mentioned above can be used to interacting with this metastore service

    In a word, ideally, Hive is an external data source of Spark SQL via Spark SQL's external data sources API, while Spark SQL can also be viewed as a data source of Hive via Hive's storage handler API.

@chenghao-intel chenghao-intel changed the title [SPARK-6923] [SQL] Hive MetaStore API cannot access Data Sourced table schema correctly [SPARK-6923][SPARK-7550][SQL][WIP] Hive MetaStore API cannot access Data Sourced table schema correctly Jun 15, 2015
@SparkQA
Copy link

SparkQA commented Jun 15, 2015

Test build #34913 has finished for PR 5733 at commit 16bc38a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need this configuration key?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is necessary.

@chenghao-intel chenghao-intel changed the title [SPARK-6923][SPARK-7550][SQL][WIP] Hive MetaStore API cannot access Data Sourced table schema correctly [SPARK-6923][SPARK-7550][SQL] Hive MetaStore API cannot access Data Sourced table schema correctly Jul 17, 2015
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably a reason that we need the configuration key "spark.sql.hive.writeDataSourceSchema"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't quite get it...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example, Hive don't support multiple input paths for data source, but Spark SQL does, we will throws exception if it's the case. So people can continue processing the multi-paths data source by simply disable this feature.

@SparkQA
Copy link

SparkQA commented Jul 17, 2015

Test build #37598 has finished for PR 5733 at commit 76027a5.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 17, 2015

Test build #37614 has finished for PR 5733 at commit 4cb3656.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@chenghao-intel
Copy link
Contributor Author

cc @liancheng

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After the change, these comments don't make grammatical sense, and I'm not sure what information they are trying to convey.

@SparkQA
Copy link

SparkQA commented Jul 20, 2015

Test build #37787 has finished for PR 5733 at commit 3b1a39a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class ConcatWs(children: Seq[Expression])

@SparkQA
Copy link

SparkQA commented Jul 27, 2015

Test build #38491 has finished for PR 5733 at commit 6d0e845.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 27, 2015

Test build #38492 has finished for PR 5733 at commit 07570e9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@chenghao-intel
Copy link
Contributor Author

@liancheng don't forget to review this? :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function always persists table metadata into Hive's metastore. But the table is not accessible from Hive unless the underlying data source is either Parquet or ORC.

@liancheng
Copy link
Contributor

@chenghao-intel Please help bringing this PR up to date. Thanks!

@SparkQA
Copy link

SparkQA commented Jul 29, 2015

Test build #38801 has finished for PR 5733 at commit c4ec806.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 29, 2015

Test build #38805 has finished for PR 5733 at commit e05e119.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@chenghao-intel
Copy link
Contributor Author

@liancheng any more comments?

@liancheng
Copy link
Contributor

Refactored this PR with chenghao-intel#2. Major changes:

  • Remove spark.sql.hive.writeDataSourceSchema.
  • Always persist the data source relation in Hive compatible format when possible, and give warning logs to indicate why we can't do this.
  • Now we ony persist non-partitioned HadoopFsRelation with a single input path. The original PR only persists partition columns information without adding individual partitions, so persisting partitioned tables actually doesn't work.
  • Refactor test cases.

@liancheng
Copy link
Contributor

@chenghao-intel Please help updating the PR description. Thanks!

@chenghao-intel
Copy link
Contributor Author

Updated!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this still true?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, it's not anymore, I forgot to update the comment in my PR...

@SparkQA
Copy link

SparkQA commented Aug 5, 2015

Test build #39868 has finished for PR 5733 at commit bc3cd73.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@liancheng
Copy link
Contributor

I'm opening another PR based on this one to fix the build error as well as the comments.

@liancheng
Copy link
Contributor

@marmbrus Just opened #7967 to supersede this one.

@chenghao-intel
Copy link
Contributor Author

Thank you @liancheng I am closing this PR.

asfgit pushed a commit that referenced this pull request Aug 6, 2015
…e compatible format when possible

This PR is a fork of PR #5733 authored by chenghao-intel.  For committers who's going to merge this PR, please set the author to "Cheng Hao <hao.chengintel.com>".

----

When a data source relation meets the following requirements, we persist it in Hive compatible format, so that other systems like Hive can access it:

1. It's a `HadoopFsRelation`
2. It has only one input path
3. It's non-partitioned
4. It's data source provider can be naturally mapped to a Hive builtin SerDe (e.g. ORC and Parquet)

Author: Cheng Lian <lian@databricks.com>
Author: Cheng Hao <hao.cheng@intel.com>

Closes #7967 from liancheng/spark-6923/refactoring-pr-5733 and squashes the following commits:

5175ee6 [Cheng Lian] Fixes an oudated comment
3870166 [Cheng Lian] Fixes build error and comments
864acee [Cheng Lian] Refactors PR #5733
3490cdc [Cheng Hao] update the scaladoc
6f57669 [Cheng Hao] write schema info to hivemetastore for data source
asfgit pushed a commit that referenced this pull request Aug 6, 2015
…e compatible format when possible

This PR is a fork of PR #5733 authored by chenghao-intel.  For committers who's going to merge this PR, please set the author to "Cheng Hao <hao.chengintel.com>".

----

When a data source relation meets the following requirements, we persist it in Hive compatible format, so that other systems like Hive can access it:

1. It's a `HadoopFsRelation`
2. It has only one input path
3. It's non-partitioned
4. It's data source provider can be naturally mapped to a Hive builtin SerDe (e.g. ORC and Parquet)

Author: Cheng Lian <lian@databricks.com>
Author: Cheng Hao <hao.cheng@intel.com>

Closes #7967 from liancheng/spark-6923/refactoring-pr-5733 and squashes the following commits:

5175ee6 [Cheng Lian] Fixes an oudated comment
3870166 [Cheng Lian] Fixes build error and comments
864acee [Cheng Lian] Refactors PR #5733
3490cdc [Cheng Hao] update the scaladoc
6f57669 [Cheng Hao] write schema info to hivemetastore for data source

(cherry picked from commit 119b590)
Signed-off-by: Reynold Xin <rxin@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants