[SPARK-17361][SQL] file-based external table without path should not be created #14921

cloud-fan · 2016-09-01T14:55:51Z

What changes were proposed in this pull request?

Using the public Catalog API, users can create a file-based data source table, without giving the path options. For this case, currently we can create the table successfully, but fail when we read it. Ideally we should fail during creation.

This is because when we create data source table, we resolve the data source relation without validating path: resolveRelation(checkPathExist = false).

Looking back to why we add this trick(checkPathExist), it's because when we call resolveRelation for managed table, we add the path to data source options but the path is not created yet. So why we add this not-yet-created path to data source options? This PR fix the problem by adding path to options after we call resolveRelation. Then we can remove the checkPathExist parameter in DataSource.resolveRelation and do some related cleanups.

How was this patch tested?

existing tests and new test in CatalogSuite

cloud-fan · 2016-09-01T15:01:40Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala

we have renamed CatalogStorageFormat.serdeProperties to properties, this should also be updated.

cloud-fan · 2016-09-01T15:04:11Z

cc @yhuai @gatorsmile

SparkQA · 2016-09-01T16:19:19Z

Test build #64788 has finished for PR 14921 at commit d53bf61.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

…ata source

cloud-fan · 2016-09-02T00:29:50Z

also cc @srinathshankar

SparkQA · 2016-09-02T02:32:02Z

Test build #64819 has finished for PR 14921 at commit 2533d65.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2016-09-02T02:45:51Z

LGTM

clockfly · 2016-09-02T04:08:54Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala

-   *                       path of the table does not exist).
   */
-  def resolveRelation(checkPathExist: Boolean = true): BaseRelation = {
+  def resolveRelation(): BaseRelation = {


Checked with Wenchen, it is not safe to skip calling resolveRelation() when it is a managed table.

For example, if it is a JDBC Relation provider, we will call
dataSource.createRelation(sparkSession.sqlContext, caseInsensitiveOptions) to some extra check

def resolveRelation(checkPathExist: Boolean = true): BaseRelation = { val caseInsensitiveOptions = new CaseInsensitiveMap(options) val relation = (providingClass.newInstance(), userSpecifiedSchema) match { // TODO: Throw when too much is given. case (dataSource: SchemaRelationProvider, Some(schema)) => dataSource.createRelation(sparkSession.sqlContext, caseInsensitiveOptions, schema) case (dataSource: RelationProvider, None) => dataSource.createRelation(sparkSession.sqlContext, caseInsensitiveOptions)

@clockfly Sorry, I did not get your point. What you said above is only for the read path, right? The changes we did here is for the write path.

FYI, today, I just updated the write path for JDBC connection. #14077

@gatorsmile I means write path.

When createRelation() is called on a RelationProvider, RelationProvider may do some extra check to make sure the options provided are valid. We'd better enforce the check when trying to create a managed table.

For example, JdbcRelationProvider will validate the options

class JdbcRelationProvider extends RelationProvider with DataSourceRegister { override def shortName(): String = "jdbc" /** Returns a new base relation with the given parameters. */ override def createRelation( sqlContext: SQLContext, parameters: Map[String, String]): BaseRelation = { val jdbcOptions = new JDBCOptions(parameters) if (jdbcOptions.partitionColumn != null && (jdbcOptions.lowerBound == null || jdbcOptions.upperBound == null || jdbcOptions.numPartitions == null)) { sys.error("Partitioning incompletely specified") } val partitionInfo = if (jdbcOptions.partitionColumn == null) { null } else { JDBCPartitioningInfo( jdbcOptions.partitionColumn, jdbcOptions.lowerBound.toLong, jdbcOptions.upperBound.toLong, jdbcOptions.numPartitions.toInt) } val parts = JDBCRelation.columnPartition(partitionInfo) val properties = new Properties() // Additional properties that we will pass to getConnection parameters.foreach(kv => properties.setProperty(kv._1, kv._2)) JDBCRelation(jdbcOptions.url, jdbcOptions.table, parts, properties)(sqlContext.sparkSession) } }

What I said before is wrong, managed table still need to call resolveRelation to do some validation, because the data source may not be file-based but something else. From the code:

def resolveRelation(checkPathExist: Boolean = true): BaseRelation = { val caseInsensitiveOptions = new CaseInsensitiveMap(options) val relation = (providingClass.newInstance(), userSpecifiedSchema) match { // TODO: Throw when too much is given. case (dataSource: SchemaRelationProvider, Some(schema)) => dataSource.createRelation(sparkSession.sqlContext, caseInsensitiveOptions, schema) case (dataSource: RelationProvider, None) => dataSource.createRelation(sparkSession.sqlContext, caseInsensitiveOptions) ...

dataSource.createRelation may do some custom checking and we can't assume it's useless for managed table.

When a data source wants to implement a write path (save API), they need to extend the trait CreatableRelationProvider. That is what my PR #14077 does.

Based on my understanding, resolveRelation is not invoked by the write path of the non-file based data sources.

After a discussion with Wenchen, resolveRelation will be invoked by CREATE TABLE ... USING..., although the write path in DataFrameWriterAPIs does not invoke it. Thanks! @clockfly

To clarify it, RelationProvider is not only for read path.

clockfly · 2016-09-02T04:37:18Z

＋1

clockfly · 2016-09-02T04:50:16Z

@cloud-fan Can you update the PR tile and PR description to be more user-facing.

For example, it is better to use "CREATE TABLE ..USING.." in the PR title.
It is also better to list some user behavior change in the PR description.

Current description seems too developer-facing:)

cloud-fan · 2016-09-02T05:42:49Z

@clockfly updated, the only external behavior change is what I fixed in this PR: creating file-based external table without path will fail.

SparkQA · 2016-09-02T06:27:03Z

Test build #64830 has finished for PR 14921 at commit eefe3bc.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2016-09-02T16:02:30Z

LGTM again : )

cloud-fan · 2016-09-03T03:32:58Z

sql/core/src/test/scala/org/apache/spark/sql/internal/CatalogSuite.scala

+    assert(e.message.contains("Unable to infer schema"))
+  }
+
+  test("createExternalTable should not fail if path is not given but schema is given " +


this behaviour is consistent with hive.

SparkQA · 2016-09-03T05:45:14Z

Test build #64888 has finished for PR 14921 at commit 600ba8c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-09-06T05:43:00Z

Test build #64962 has finished for PR 14921 at commit 4071bec.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-09-06T05:44:54Z

Test build #64961 has finished for PR 14921 at commit 43fb72e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2016-09-06T06:18:12Z

thanks for the review, merging to master!

cloud-fan reviewed Sep 1, 2016
View reviewed changes

createExternalTable should fail if path is not given for file-based d…

2533d65

…ata source

cloud-fan force-pushed the check-path branch from d53bf61 to 2533d65 Compare September 2, 2016 00:09

clockfly reviewed Sep 2, 2016
View reviewed changes

better approach

eefe3bc

cloud-fan changed the title ~~[SPARK-17361][SQL] createExternalTable should fail if path is not given for file-based data source~~ [SPARK-17361][SQL] file-based external table without path should not be created Sep 2, 2016

add test

600ba8c

cloud-fan reviewed Sep 3, 2016
View reviewed changes

cloud-fan added 2 commits September 6, 2016 11:02

Merge remote-tracking branch 'origin/master' into check-path

e42a0ad

one more check

4071bec

cloud-fan force-pushed the check-path branch from 43fb72e to 4071bec Compare September 6, 2016 03:25

asfgit closed this in c0ae6bc Sep 6, 2016

cloud-fan deleted the check-path branch December 14, 2016 12:33

[SPARK-17361][SQL] file-based external table without path should not be created #14921

[SPARK-17361][SQL] file-based external table without path should not be created #14921

Uh oh!

Conversation

cloud-fan commented Sep 1, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Sep 1, 2016

Uh oh!

SparkQA commented Sep 1, 2016

Uh oh!

cloud-fan commented Sep 2, 2016

Uh oh!

SparkQA commented Sep 2, 2016

Uh oh!

gatorsmile commented Sep 2, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gatorsmile Sep 2, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan Sep 2, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

clockfly commented Sep 2, 2016

Uh oh!

clockfly commented Sep 2, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cloud-fan commented Sep 2, 2016

Uh oh!

SparkQA commented Sep 2, 2016

Uh oh!

gatorsmile commented Sep 2, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Sep 3, 2016

Uh oh!

SparkQA commented Sep 6, 2016

Uh oh!

SparkQA commented Sep 6, 2016

Uh oh!

cloud-fan commented Sep 6, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

cloud-fan commented Sep 1, 2016 •

edited

Loading

gatorsmile Sep 2, 2016 •

edited

Loading

cloud-fan Sep 2, 2016 •

edited

Loading

clockfly commented Sep 2, 2016 •

edited

Loading