Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,8 @@ import java.util.regex.Pattern
import scala.collection.mutable
import scala.util.control.NonFatal

import org.apache.hadoop.fs.Path

import org.apache.spark.internal.Logging
import org.apache.spark.sql._
import org.apache.spark.sql.catalyst.TableIdentifier
Expand Down Expand Up @@ -354,7 +356,27 @@ object CreateDataSourceTableUtils extends Logging {
tableType = tableType,
schema = Nil,
storage = CatalogStorageFormat(
locationUri = None,
// We don't want Hive metastore to implicitly create a table directory,
// which may be not the one Data Source table is referring to,
// yet which will be left behind when the table is dropped for an external table
locationUri = if (new CaseInsensitiveMap(options).get("path").isDefined) {
val path = new Path(new CaseInsensitiveMap(options).get("path").get)
val fs = path.getFileSystem(sparkSession.sessionState.newHadoopConf())
if (fs.exists(path)) {
// if the provided path exists, Hive metastore only takes directory
// as table data location
if (fs.getFileStatus(path).isDirectory) {
Some(path.toUri.toString)
} else {
Some(path.getParent.toUri.toString)
}
} else {
// If the path does not exists yet, it is assumed to be directory
Some(path.toUri.toString)
}
} else {
None
},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The following line should be enough for localtionUri:

locationUri = new CaseInsensitiveMap(options).get("path")

Consider the following directory layout containing two Parquet files:

/tmp/dir/
  part-00001.parquet
  part-00002.parquet

If we pass "/tmp/dir/part-00001.parquet" as file path, the logic above will use the "/tmp/dir/" as locationUri, thus "part-00002.parquet" is also included, which is not the expected behavior.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@liancheng Thanks! I tried this before, but hive complained that the path is either not a directory or it can not create one with the path.. This was the reason it failed the testcases in MetastoreDataSourcesSuite, wherever we create a datasource (non-hive compatible) table with an exact file name. Example:

[info] - CTAS a managed table *** FAILED *** (365 milliseconds)
[info]   org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:file:/home/xwu0226/spark/sql/hive/target/scala-2.11/test-classes/sample.json is not a directory or unable to create one);

I also tried in hive shell:

hive> create external table t_txt1 (c1 int) location '/tmp/test1.txt';
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. MetaException(message:hdfs://bdavm009.svl.ibm.com:8020/tmp/test1.txt is not a directory or unable to create one)

So it seems hive only takes a directory as table location. In our case, we need to give hive a directory via locationURI.

For your concern of having a directory containing multiple files, in this case, we are in the non-hive compatible code path, do we still expect the consistency between hive and spark sql? Querying from spark sql will return expected results. while the results will be different from hive. But current behavior of non-hive compatible table is like this already.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm... Then I think we probably should save the path as a SerDe property (similar to schema of persisted data source tables). @yhuai How do you think? It breaks existing functionality if we can't read individual files.

Copy link
Contributor Author

@xwu0226 xwu0226 May 21, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The full path is already populated in the SerDe Property with the options, see lines. The select will still work from spark sql because HiveMetastoreCatalog.lookupRelation uses cachedDataSourceTables that loads the persisted data source table with the options = CatalogTable.storage.serdeProperties and DataSource.resolveRelation takes options.get("path") for the relation's file location. See

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(For future reference, the above comment is replied here.)

inputFormat = None,
outputFormat = None,
serde = None,
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -1104,4 +1104,24 @@ class MetastoreDataSourcesSuite extends QueryTest with SQLTestUtils with TestHiv
}
}
}

test("SPARK-15269: non-hive compative table") {
withTempPath { dir =>
val path = dir.getCanonicalPath
val df = spark.range(1).toDF()

val hadoopPath = new Path(path, "data")
val fs = hadoopPath.getFileSystem(spark.sessionState.newHadoopConf())
val qualified = hadoopPath.makeQualified(fs.getUri, fs.getWorkingDirectory)
fs.delete(qualified, true)
df.write.mode(SaveMode.Overwrite).json(qualified.toUri.toString)

withTable("ddl_test1") {
sql(s"CREATE TABLE ddl_test1 USING json OPTIONS (PATH '${qualified.toUri.toString}')")
sql("DROP TABLE ddl_test1")
sql(s"CREATE TABLE ddl_test1 USING json AS SELECT 10 AS a")
checkAnswer(sql("select * from ddl_test1"), Seq(Row(10)))
}
}
}
}