-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-17361][SQL] file-based external table without path should not be created #14921
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we have renamed CatalogStorageFormat.serdeProperties to properties, this should also be updated.
|
Test build #64788 has finished for PR 14921 at commit
|
d53bf61 to
2533d65
Compare
|
also cc @srinathshankar |
|
Test build #64819 has finished for PR 14921 at commit
|
|
LGTM |
| * path of the table does not exist). | ||
| */ | ||
| def resolveRelation(checkPathExist: Boolean = true): BaseRelation = { | ||
| def resolveRelation(): BaseRelation = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Checked with Wenchen, it is not safe to skip calling resolveRelation() when it is a managed table.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For example, if it is a JDBC Relation provider, we will call
dataSource.createRelation(sparkSession.sqlContext, caseInsensitiveOptions) to some extra check
def resolveRelation(checkPathExist: Boolean = true): BaseRelation = {
val caseInsensitiveOptions = new CaseInsensitiveMap(options)
val relation = (providingClass.newInstance(), userSpecifiedSchema) match {
// TODO: Throw when too much is given.
case (dataSource: SchemaRelationProvider, Some(schema)) =>
dataSource.createRelation(sparkSession.sqlContext, caseInsensitiveOptions, schema)
case (dataSource: RelationProvider, None) =>
dataSource.createRelation(sparkSession.sqlContext, caseInsensitiveOptions)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@clockfly Sorry, I did not get your point. What you said above is only for the read path, right? The changes we did here is for the write path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI, today, I just updated the write path for JDBC connection. #14077
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gatorsmile I means write path.
When createRelation() is called on a RelationProvider, RelationProvider may do some extra check to make sure the options provided are valid. We'd better enforce the check when trying to create a managed table.
For example, JdbcRelationProvider will validate the options
class JdbcRelationProvider extends RelationProvider with DataSourceRegister {
override def shortName(): String = "jdbc"
/** Returns a new base relation with the given parameters. */
override def createRelation(
sqlContext: SQLContext,
parameters: Map[String, String]): BaseRelation = {
val jdbcOptions = new JDBCOptions(parameters)
if (jdbcOptions.partitionColumn != null
&& (jdbcOptions.lowerBound == null
|| jdbcOptions.upperBound == null
|| jdbcOptions.numPartitions == null)) {
sys.error("Partitioning incompletely specified")
}
val partitionInfo = if (jdbcOptions.partitionColumn == null) {
null
} else {
JDBCPartitioningInfo(
jdbcOptions.partitionColumn,
jdbcOptions.lowerBound.toLong,
jdbcOptions.upperBound.toLong,
jdbcOptions.numPartitions.toInt)
}
val parts = JDBCRelation.columnPartition(partitionInfo)
val properties = new Properties() // Additional properties that we will pass to getConnection
parameters.foreach(kv => properties.setProperty(kv._1, kv._2))
JDBCRelation(jdbcOptions.url, jdbcOptions.table, parts, properties)(sqlContext.sparkSession)
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I said before is wrong, managed table still need to call resolveRelation to do some validation, because the data source may not be file-based but something else. From the code:
def resolveRelation(checkPathExist: Boolean = true): BaseRelation = {
val caseInsensitiveOptions = new CaseInsensitiveMap(options)
val relation = (providingClass.newInstance(), userSpecifiedSchema) match {
// TODO: Throw when too much is given.
case (dataSource: SchemaRelationProvider, Some(schema)) =>
dataSource.createRelation(sparkSession.sqlContext, caseInsensitiveOptions, schema)
case (dataSource: RelationProvider, None) =>
dataSource.createRelation(sparkSession.sqlContext, caseInsensitiveOptions)
...
dataSource.createRelation may do some custom checking and we can't assume it's useless for managed table.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When a data source wants to implement a write path (save API), they need to extend the trait CreatableRelationProvider. That is what my PR #14077 does.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on my understanding, resolveRelation is not invoked by the write path of the non-file based data sources.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After a discussion with Wenchen, resolveRelation will be invoked by CREATE TABLE ... USING..., although the write path in DataFrameWriterAPIs does not invoke it. Thanks! @clockfly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To clarify it, RelationProvider is not only for read path.
|
+1 |
|
@cloud-fan Can you update the PR tile and PR description to be more user-facing. For example, it is better to use "CREATE TABLE ..USING.." in the PR title. Current description seems too developer-facing:) |
|
@clockfly updated, the only external behavior change is what I fixed in this PR: creating file-based external table without path will fail. |
|
Test build #64830 has finished for PR 14921 at commit
|
|
LGTM again : ) |
| assert(e.message.contains("Unable to infer schema")) | ||
| } | ||
|
|
||
| test("createExternalTable should not fail if path is not given but schema is given " + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this behaviour is consistent with hive.
|
Test build #64888 has finished for PR 14921 at commit
|
43fb72e to
4071bec
Compare
|
Test build #64962 has finished for PR 14921 at commit
|
|
Test build #64961 has finished for PR 14921 at commit
|
|
thanks for the review, merging to master! |
What changes were proposed in this pull request?
Using the public
CatalogAPI, users can create a file-based data source table, without giving the path options. For this case, currently we can create the table successfully, but fail when we read it. Ideally we should fail during creation.This is because when we create data source table, we resolve the data source relation without validating path:
resolveRelation(checkPathExist = false).Looking back to why we add this trick(
checkPathExist), it's because when we callresolveRelationfor managed table, we add the path to data source options but the path is not created yet. So why we add this not-yet-created path to data source options? This PR fix the problem by adding path to options after we callresolveRelation. Then we can remove thecheckPathExistparameter inDataSource.resolveRelationand do some related cleanups.How was this patch tested?
existing tests and new test in
CatalogSuite