Skip to content

Commit

Permalink
[SPARK-27136][SQL] Remove data source option check_files_exist
Browse files Browse the repository at this point in the history
## What changes were proposed in this pull request?

The data source option check_files_exist is introduced in In #23383 when the file source V2 framework is implemented. In the PR, FileIndex was created as a member of FileTable, so that we could implement partition pruning like 0f9fcab in the future. At that time `FileIndex`es will always be created for file writes, so we needed the option to decide whether to check file existence.

After #23774, the option is not needed anymore, since Dataframe writes won't create unnecessary FileIndex. This PR is to remove the option.

## How was this patch tested?

Unit test.

Closes #24069 from gengliangwang/removeOptionCheckFilesExist.

Authored-by: Gengliang Wang <gengliang.wang@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
  • Loading branch information
gengliangwang authored and cloud-fan committed Mar 15, 2019
1 parent 8819eab commit 6d22ee3
Show file tree
Hide file tree
Showing 3 changed files with 5 additions and 9 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -213,9 +213,8 @@ class DataFrameReader private[sql](sparkSession: SparkSession) extends Logging {
val objectMapper = new ObjectMapper()
Some("paths" -> objectMapper.writeValueAsString(paths.toArray))
}
// TODO SPARK-27113: remove this option.
val checkFilesExistsOpt = "check_files_exist" -> "true"
val finalOptions = sessionOptions ++ extraOptions.toMap ++ pathsOption + checkFilesExistsOpt

val finalOptions = sessionOptions ++ extraOptions.toMap ++ pathsOption
val dsOptions = new CaseInsensitiveStringMap(finalOptions.asJava)
val table = userSpecifiedSchema match {
case Some(schema) => provider.getTable(dsOptions, schema)
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -261,10 +261,9 @@ final class DataFrameWriter[T] private[sql](ds: Dataset[T]) {
val provider = cls.getConstructor().newInstance().asInstanceOf[TableProvider]
val sessionOptions = DataSourceV2Utils.extractSessionConfigs(
provider, session.sessionState.conf)
// TODO SPARK-27113: remove this option.
val checkFilesExistsOption = "check_files_exist" -> "false"
val options = sessionOptions ++ extraOptions + checkFilesExistsOption
val options = sessionOptions ++ extraOptions
val dsOptions = new CaseInsensitiveStringMap(options.asJava)

This comment has been minimized.

Copy link
@rdblue

rdblue Mar 15, 2019

Contributor

@gengliangwang, @cloud-fan,

Please do not commit unnecessary or non-functional changes like this. This extra blank line caused a merge conflict with #24012 that required me to go fix it by hand.

These careless changes make it harder for all contributors to work on Spark because we have to go fix problems that shouldn't require attention. Please be more careful.

This comment has been minimized.

Copy link
@gengliangwang

gengliangwang Mar 16, 2019

Author Member

@rdblue got it, sorry about that.

provider.getTable(dsOptions) match {
case table: SupportsBatchWrite =>
lazy val relation = DataSourceV2Relation.create(table, dsOptions)
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -36,10 +36,8 @@ abstract class FileTable(
lazy val fileIndex: PartitioningAwareFileIndex = {
val scalaMap = options.asScala.toMap
val hadoopConf = sparkSession.sessionState.newHadoopConfWithOptions(scalaMap)
// This is an internal config so must be present.
val checkFilesExist = options.get("check_files_exist").toBoolean
val rootPathsSpecified = DataSource.checkAndGlobPathIfNecessary(paths, hadoopConf,
checkEmptyGlobPath = true, checkFilesExist = checkFilesExist)
checkEmptyGlobPath = true, checkFilesExist = true)
val fileStatusCache = FileStatusCache.getOrCreate(sparkSession)
new InMemoryFileIndex(
sparkSession, rootPathsSpecified, scalaMap, userSpecifiedSchema, fileStatusCache)
Expand Down

0 comments on commit 6d22ee3

Please sign in to comment.