[SPARK-16975][SQL] Column-partition path starting '_' should be handled correctly #14585

dongjoon-hyun · 2016-08-10T21:10:17Z

What changes were proposed in this pull request?

Currently, Spark ignores path names starting with underscore _ and .. This causes read-failures for the column-partitioned file data sources whose partition column names starts from '_', e.g. _col.

Before

scala> spark.range(10).withColumn("_locality_code", $"id").write.partitionBy("_locality_code").save("/tmp/parquet")
scala> spark.read.parquet("/tmp/parquet")
org.apache.spark.sql.AnalysisException: Unable to infer schema for ParquetFormat at /tmp/parquet20. It must be specified manually;

After

scala> spark.range(10).withColumn("_locality_code", $"id").write.partitionBy("_locality_code").save("/tmp/parquet")
scala> spark.read.parquet("/tmp/parquet")
res2: org.apache.spark.sql.DataFrame = [id: bigint, _locality_code: int]

How was this patch tested?

Pass the Jenkins with a new test case.

…ed correctly

dongjoon-hyun · 2016-08-10T21:17:17Z

cc @rxin

SparkQA · 2016-08-10T22:59:47Z

Test build #63556 has finished for PR 14585 at commit dd9943c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2016-08-11T20:34:43Z

Hi, @rxin .
Could you review this PR?

liancheng · 2016-08-12T06:38:36Z

LGTM, merging to master and branch-2.0. Thanks!

rxin · 2016-08-12T06:43:46Z

Did this make it into branch-2.0?

liancheng · 2016-08-12T06:45:09Z

@rxin The test code conflicts with branch-2.0, I'm resolving it manually.

…ed correctly Currently, Spark ignores path names starting with underscore `_` and `.`. This causes read-failures for the column-partitioned file data sources whose partition column names starts from '_', e.g. `_col`. **Before** ```scala scala> spark.range(10).withColumn("_locality_code", $"id").write.partitionBy("_locality_code").save("/tmp/parquet") scala> spark.read.parquet("/tmp/parquet") org.apache.spark.sql.AnalysisException: Unable to infer schema for ParquetFormat at /tmp/parquet20. It must be specified manually; ``` **After** ```scala scala> spark.range(10).withColumn("_locality_code", $"id").write.partitionBy("_locality_code").save("/tmp/parquet") scala> spark.read.parquet("/tmp/parquet") res2: org.apache.spark.sql.DataFrame = [id: bigint, _locality_code: int] ``` Pass the Jenkins with a new test case. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #14585 from dongjoon-hyun/SPARK-16975-PARQUET. (cherry picked from commit abff92b) Signed-off-by: Cheng Lian <lian@databricks.com>

liancheng · 2016-08-12T07:00:53Z

OK, resolved the conflict manually and got it merged into branch-2.0.

HyukjinKwon · 2016-08-12T08:20:16Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JsonFileFormat.scala

      val jsonFiles = files.filterNot { status =>
        val name = status.getPath.getName
-        name.startsWith("_") || name.startsWith(".")
+        (name.startsWith("_") && !name.contains("=")) || name.startsWith(".")


Hm.. @liancheng @dongjoon-hyun Do you mind if I ask a question please?

If my understanding is correct, the name will be part-.. file whether the parent directory contains _ or not. So, it would be unnecessary extra checking. It is happening in ParquetFileFormat as well.

Do you mind if I open a small follow-up to clean up those?

It might look inconsistent because OrcFileFormat has the similar checking here and CSVFileFormat has the similar checking here. If it looks nicer to add the condition just in case, I can make this consistent for ORC and CSV as well.

Oh yea, you're right. Here files only contains leaf files and this check is redundant. Please feel free to clean it up. Thanks!

dongjoon-hyun · 2016-08-12T17:09:14Z

Thank you for review and merging, @liancheng and @rxin .

…ata sources implementing FileFormat ## What changes were proposed in this pull request? This PR cleans up duplicated checking for file paths in implemented data sources and prevent to attempt to list twice in ORC data source. apache#14585 handles a problem for the partition column name having `_` and the issue itself is resolved correctly. However, it seems the data sources implementing `FileFormat` are validating the paths duplicately. Assuming from the comment in `CSVFileFormat`, `// TODO: Move filtering.`, I guess we don't have to check this duplicately. Currently, this seems being filtered in `PartitioningAwareFileIndex.shouldFilterOut` and`PartitioningAwareFileIndex.isDataPath`. So, `FileFormat.inferSchema` will always receive leaf files. For example, running to codes below: ``` scala spark.range(10).withColumn("_locality_code", $"id").write.partitionBy("_locality_code").save("/tmp/parquet") spark.read.parquet("/tmp/parquet") ``` gives the paths below without directories but just valid data files: ``` bash /tmp/parquet/_col=0/part-r-00000-094a8efa-bece-4b50-b54c-7918d1f7b3f8.snappy.parquet /tmp/parquet/_col=1/part-r-00000-094a8efa-bece-4b50-b54c-7918d1f7b3f8.snappy.parquet /tmp/parquet/_col=2/part-r-00000-25de2b50-225a-4bcf-a2bc-9eb9ed407ef6.snappy.parquet ... ``` to `FileFormat.inferSchema`. ## How was this patch tested? Unit test added in `HadoopFsRelationTest` and related existing tests. Author: hyukjinkwon <gurwls223@gmail.com> Closes apache#14627 from HyukjinKwon/SPARK-16975.

[SPARK-16975][SQL] Column-partition path starting '_' should be handl…

dd9943c

…ed correctly

asfgit closed this in abff92b Aug 12, 2016

HyukjinKwon reviewed Aug 12, 2016
View reviewed changes

HyukjinKwon mentioned this pull request Aug 13, 2016

[SPARK-16975][SQL][FOLLOWUP] Do not duplicately check file paths in data sources implementing FileFormat #14627

Closed

dongjoon-hyun deleted the SPARK-16975-PARQUET branch January 17, 2018 09:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-16975][SQL] Column-partition path starting '_' should be handled correctly #14585

[SPARK-16975][SQL] Column-partition path starting '_' should be handled correctly #14585

Uh oh!

dongjoon-hyun commented Aug 10, 2016 •

edited

Loading

Uh oh!

dongjoon-hyun commented Aug 10, 2016

Uh oh!

SparkQA commented Aug 10, 2016

Uh oh!

dongjoon-hyun commented Aug 11, 2016

Uh oh!

liancheng commented Aug 12, 2016

Uh oh!

rxin commented Aug 12, 2016

Uh oh!

liancheng commented Aug 12, 2016

Uh oh!

liancheng commented Aug 12, 2016

Uh oh!

HyukjinKwon Aug 12, 2016 •

edited

Loading

Uh oh!

HyukjinKwon Aug 12, 2016 •

edited

Loading

Uh oh!

liancheng Sep 27, 2016

Uh oh!

dongjoon-hyun commented Aug 12, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

[SPARK-16975][SQL] Column-partition path starting '_' should be handled correctly #14585

[SPARK-16975][SQL] Column-partition path starting '_' should be handled correctly #14585

Uh oh!

Conversation

dongjoon-hyun commented Aug 10, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

dongjoon-hyun commented Aug 10, 2016

Uh oh!

SparkQA commented Aug 10, 2016

Uh oh!

dongjoon-hyun commented Aug 11, 2016

Uh oh!

liancheng commented Aug 12, 2016

Uh oh!

rxin commented Aug 12, 2016

Uh oh!

liancheng commented Aug 12, 2016

Uh oh!

liancheng commented Aug 12, 2016

Uh oh!

HyukjinKwon Aug 12, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Aug 12, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

liancheng Sep 27, 2016

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun commented Aug 12, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

dongjoon-hyun commented Aug 10, 2016 •

edited

Loading

HyukjinKwon Aug 12, 2016 •

edited

Loading

HyukjinKwon Aug 12, 2016 •

edited

Loading