Skip to content

Conversation

@cloud-fan
Copy link
Contributor

@cloud-fan cloud-fan commented Sep 1, 2016

What changes were proposed in this pull request?

Using the public Catalog API, users can create a file-based data source table, without giving the path options. For this case, currently we can create the table successfully, but fail when we read it. Ideally we should fail during creation.

This is because when we create data source table, we resolve the data source relation without validating path: resolveRelation(checkPathExist = false).

Looking back to why we add this trick(checkPathExist), it's because when we call resolveRelation for managed table, we add the path to data source options but the path is not created yet. So why we add this not-yet-created path to data source options? This PR fix the problem by adding path to options after we call resolveRelation. Then we can remove the checkPathExist parameter in DataSource.resolveRelation and do some related cleanups.

How was this patch tested?

existing tests and new test in CatalogSuite

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we have renamed CatalogStorageFormat.serdeProperties to properties, this should also be updated.

@cloud-fan
Copy link
Contributor Author

cc @yhuai @gatorsmile

@SparkQA
Copy link

SparkQA commented Sep 1, 2016

Test build #64788 has finished for PR 14921 at commit d53bf61.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor Author

also cc @srinathshankar

@SparkQA
Copy link

SparkQA commented Sep 2, 2016

Test build #64819 has finished for PR 14921 at commit 2533d65.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

LGTM

* path of the table does not exist).
*/
def resolveRelation(checkPathExist: Boolean = true): BaseRelation = {
def resolveRelation(): BaseRelation = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checked with Wenchen, it is not safe to skip calling resolveRelation() when it is a managed table.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example, if it is a JDBC Relation provider, we will call
dataSource.createRelation(sparkSession.sqlContext, caseInsensitiveOptions) to some extra check

  def resolveRelation(checkPathExist: Boolean = true): BaseRelation = {
    val caseInsensitiveOptions = new CaseInsensitiveMap(options)
    val relation = (providingClass.newInstance(), userSpecifiedSchema) match {
      // TODO: Throw when too much is given.
      case (dataSource: SchemaRelationProvider, Some(schema)) =>
        dataSource.createRelation(sparkSession.sqlContext, caseInsensitiveOptions, schema)
      case (dataSource: RelationProvider, None) =>
        dataSource.createRelation(sparkSession.sqlContext, caseInsensitiveOptions)

Copy link
Member

@gatorsmile gatorsmile Sep 2, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@clockfly Sorry, I did not get your point. What you said above is only for the read path, right? The changes we did here is for the write path.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI, today, I just updated the write path for JDBC connection. #14077

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gatorsmile I means write path.

When createRelation() is called on a RelationProvider, RelationProvider may do some extra check to make sure the options provided are valid. We'd better enforce the check when trying to create a managed table.

For example, JdbcRelationProvider will validate the options

class JdbcRelationProvider extends RelationProvider with DataSourceRegister {

  override def shortName(): String = "jdbc"

  /** Returns a new base relation with the given parameters. */
  override def createRelation(
      sqlContext: SQLContext,
      parameters: Map[String, String]): BaseRelation = {
    val jdbcOptions = new JDBCOptions(parameters)
    if (jdbcOptions.partitionColumn != null
      && (jdbcOptions.lowerBound == null
        || jdbcOptions.upperBound == null
        || jdbcOptions.numPartitions == null)) {
      sys.error("Partitioning incompletely specified")
    }

    val partitionInfo = if (jdbcOptions.partitionColumn == null) {
      null
    } else {
      JDBCPartitioningInfo(
        jdbcOptions.partitionColumn,
        jdbcOptions.lowerBound.toLong,
        jdbcOptions.upperBound.toLong,
        jdbcOptions.numPartitions.toInt)
    }
    val parts = JDBCRelation.columnPartition(partitionInfo)
    val properties = new Properties() // Additional properties that we will pass to getConnection
    parameters.foreach(kv => properties.setProperty(kv._1, kv._2))
    JDBCRelation(jdbcOptions.url, jdbcOptions.table, parts, properties)(sqlContext.sparkSession)
  }
}

Copy link
Contributor Author

@cloud-fan cloud-fan Sep 2, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I said before is wrong, managed table still need to call resolveRelation to do some validation, because the data source may not be file-based but something else. From the code:

 def resolveRelation(checkPathExist: Boolean = true): BaseRelation = {
    val caseInsensitiveOptions = new CaseInsensitiveMap(options)
    val relation = (providingClass.newInstance(), userSpecifiedSchema) match {
      // TODO: Throw when too much is given.
      case (dataSource: SchemaRelationProvider, Some(schema)) =>
        dataSource.createRelation(sparkSession.sqlContext, caseInsensitiveOptions, schema)
      case (dataSource: RelationProvider, None) =>
        dataSource.createRelation(sparkSession.sqlContext, caseInsensitiveOptions)
...

dataSource.createRelation may do some custom checking and we can't assume it's useless for managed table.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When a data source wants to implement a write path (save API), they need to extend the trait CreatableRelationProvider. That is what my PR #14077 does.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on my understanding, resolveRelation is not invoked by the write path of the non-file based data sources.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After a discussion with Wenchen, resolveRelation will be invoked by CREATE TABLE ... USING..., although the write path in DataFrameWriterAPIs does not invoke it. Thanks! @clockfly

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To clarify it, RelationProvider is not only for read path.

@clockfly
Copy link
Contributor

clockfly commented Sep 2, 2016

+1

@clockfly
Copy link
Contributor

clockfly commented Sep 2, 2016

@cloud-fan Can you update the PR tile and PR description to be more user-facing.

For example, it is better to use "CREATE TABLE ..USING.." in the PR title.
It is also better to list some user behavior change in the PR description.

Current description seems too developer-facing:)

@cloud-fan cloud-fan changed the title [SPARK-17361][SQL] createExternalTable should fail if path is not given for file-based data source [SPARK-17361][SQL] file-based external table without path should not be created Sep 2, 2016
@cloud-fan
Copy link
Contributor Author

@clockfly updated, the only external behavior change is what I fixed in this PR: creating file-based external table without path will fail.

@SparkQA
Copy link

SparkQA commented Sep 2, 2016

Test build #64830 has finished for PR 14921 at commit eefe3bc.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

LGTM again : )

assert(e.message.contains("Unable to infer schema"))
}

test("createExternalTable should not fail if path is not given but schema is given " +
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this behaviour is consistent with hive.

@SparkQA
Copy link

SparkQA commented Sep 3, 2016

Test build #64888 has finished for PR 14921 at commit 600ba8c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 6, 2016

Test build #64962 has finished for PR 14921 at commit 4071bec.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 6, 2016

Test build #64961 has finished for PR 14921 at commit 43fb72e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor Author

thanks for the review, merging to master!

@asfgit asfgit closed this in c0ae6bc Sep 6, 2016
@cloud-fan cloud-fan deleted the check-path branch December 14, 2016 12:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants