Skip to content

Conversation

@dongjoon-hyun
Copy link
Member

@dongjoon-hyun dongjoon-hyun commented Aug 10, 2016

What changes were proposed in this pull request?

Currently, Spark ignores path names starting with underscore _ and .. This causes read-failures for the column-partitioned file data sources whose partition column names starts from '_', e.g. _col.

Before

scala> spark.range(10).withColumn("_locality_code", $"id").write.partitionBy("_locality_code").save("/tmp/parquet")
scala> spark.read.parquet("/tmp/parquet")
org.apache.spark.sql.AnalysisException: Unable to infer schema for ParquetFormat at /tmp/parquet20. It must be specified manually;

After

scala> spark.range(10).withColumn("_locality_code", $"id").write.partitionBy("_locality_code").save("/tmp/parquet")
scala> spark.read.parquet("/tmp/parquet")
res2: org.apache.spark.sql.DataFrame = [id: bigint, _locality_code: int]

How was this patch tested?

Pass the Jenkins with a new test case.

@dongjoon-hyun
Copy link
Member Author

cc @rxin

@SparkQA
Copy link

SparkQA commented Aug 10, 2016

Test build #63556 has finished for PR 14585 at commit dd9943c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member Author

Hi, @rxin .
Could you review this PR?

@liancheng
Copy link
Contributor

LGTM, merging to master and branch-2.0. Thanks!

@asfgit asfgit closed this in abff92b Aug 12, 2016
@rxin
Copy link
Contributor

rxin commented Aug 12, 2016

Did this make it into branch-2.0?

@liancheng
Copy link
Contributor

@rxin The test code conflicts with branch-2.0, I'm resolving it manually.

asfgit pushed a commit that referenced this pull request Aug 12, 2016
…ed correctly

Currently, Spark ignores path names starting with underscore `_` and `.`. This causes read-failures for the column-partitioned file data sources whose partition column names starts from '_', e.g. `_col`.

**Before**
```scala
scala> spark.range(10).withColumn("_locality_code", $"id").write.partitionBy("_locality_code").save("/tmp/parquet")
scala> spark.read.parquet("/tmp/parquet")
org.apache.spark.sql.AnalysisException: Unable to infer schema for ParquetFormat at /tmp/parquet20. It must be specified manually;
```

**After**
```scala
scala> spark.range(10).withColumn("_locality_code", $"id").write.partitionBy("_locality_code").save("/tmp/parquet")
scala> spark.read.parquet("/tmp/parquet")
res2: org.apache.spark.sql.DataFrame = [id: bigint, _locality_code: int]
```

Pass the Jenkins with a new test case.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #14585 from dongjoon-hyun/SPARK-16975-PARQUET.

(cherry picked from commit abff92b)
Signed-off-by: Cheng Lian <lian@databricks.com>
@liancheng
Copy link
Contributor

OK, resolved the conflict manually and got it merged into branch-2.0.

val jsonFiles = files.filterNot { status =>
val name = status.getPath.getName
name.startsWith("_") || name.startsWith(".")
(name.startsWith("_") && !name.contains("=")) || name.startsWith(".")
Copy link
Member

@HyukjinKwon HyukjinKwon Aug 12, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm.. @liancheng @dongjoon-hyun Do you mind if I ask a question please?

If my understanding is correct, the name will be part-.. file whether the parent directory contains _ or not. So, it would be unnecessary extra checking. It is happening in ParquetFileFormat as well.

Do you mind if I open a small follow-up to clean up those?

Copy link
Member

@HyukjinKwon HyukjinKwon Aug 12, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might look inconsistent because OrcFileFormat has the similar checking here and CSVFileFormat has the similar checking here. If it looks nicer to add the condition just in case, I can make this consistent for ORC and CSV as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yea, you're right. Here files only contains leaf files and this check is redundant. Please feel free to clean it up. Thanks!

@dongjoon-hyun
Copy link
Member Author

Thank you for review and merging, @liancheng and @rxin .

ghost pushed a commit to dbtsai/spark that referenced this pull request Dec 22, 2016
…ata sources implementing FileFormat

## What changes were proposed in this pull request?

This PR cleans up duplicated checking for file paths in implemented data sources and prevent to attempt to list twice in ORC data source.

apache#14585 handles a problem for the partition column name having `_` and the issue itself is resolved correctly. However, it seems the data sources implementing `FileFormat` are validating the paths duplicately. Assuming from the comment in `CSVFileFormat`, `// TODO: Move filtering.`, I guess we don't have to check this duplicately.

   Currently, this seems being filtered in `PartitioningAwareFileIndex.shouldFilterOut` and`PartitioningAwareFileIndex.isDataPath`. So, `FileFormat.inferSchema` will always receive leaf files. For example, running to codes below:

   ``` scala
   spark.range(10).withColumn("_locality_code", $"id").write.partitionBy("_locality_code").save("/tmp/parquet")
   spark.read.parquet("/tmp/parquet")
   ```

   gives the paths below without directories but just valid data files:

   ``` bash
   /tmp/parquet/_col=0/part-r-00000-094a8efa-bece-4b50-b54c-7918d1f7b3f8.snappy.parquet
   /tmp/parquet/_col=1/part-r-00000-094a8efa-bece-4b50-b54c-7918d1f7b3f8.snappy.parquet
   /tmp/parquet/_col=2/part-r-00000-25de2b50-225a-4bcf-a2bc-9eb9ed407ef6.snappy.parquet
   ...
   ```

   to `FileFormat.inferSchema`.

## How was this patch tested?

Unit test added in `HadoopFsRelationTest` and related existing tests.

Author: hyukjinkwon <gurwls223@gmail.com>

Closes apache#14627 from HyukjinKwon/SPARK-16975.
cmonkey pushed a commit to cmonkey/spark that referenced this pull request Dec 24, 2016
…ata sources implementing FileFormat

## What changes were proposed in this pull request?

This PR cleans up duplicated checking for file paths in implemented data sources and prevent to attempt to list twice in ORC data source.

apache#14585 handles a problem for the partition column name having `_` and the issue itself is resolved correctly. However, it seems the data sources implementing `FileFormat` are validating the paths duplicately. Assuming from the comment in `CSVFileFormat`, `// TODO: Move filtering.`, I guess we don't have to check this duplicately.

   Currently, this seems being filtered in `PartitioningAwareFileIndex.shouldFilterOut` and`PartitioningAwareFileIndex.isDataPath`. So, `FileFormat.inferSchema` will always receive leaf files. For example, running to codes below:

   ``` scala
   spark.range(10).withColumn("_locality_code", $"id").write.partitionBy("_locality_code").save("/tmp/parquet")
   spark.read.parquet("/tmp/parquet")
   ```

   gives the paths below without directories but just valid data files:

   ``` bash
   /tmp/parquet/_col=0/part-r-00000-094a8efa-bece-4b50-b54c-7918d1f7b3f8.snappy.parquet
   /tmp/parquet/_col=1/part-r-00000-094a8efa-bece-4b50-b54c-7918d1f7b3f8.snappy.parquet
   /tmp/parquet/_col=2/part-r-00000-25de2b50-225a-4bcf-a2bc-9eb9ed407ef6.snappy.parquet
   ...
   ```

   to `FileFormat.inferSchema`.

## How was this patch tested?

Unit test added in `HadoopFsRelationTest` and related existing tests.

Author: hyukjinkwon <gurwls223@gmail.com>

Closes apache#14627 from HyukjinKwon/SPARK-16975.
uzadude pushed a commit to uzadude/spark that referenced this pull request Jan 27, 2017
…ata sources implementing FileFormat

## What changes were proposed in this pull request?

This PR cleans up duplicated checking for file paths in implemented data sources and prevent to attempt to list twice in ORC data source.

apache#14585 handles a problem for the partition column name having `_` and the issue itself is resolved correctly. However, it seems the data sources implementing `FileFormat` are validating the paths duplicately. Assuming from the comment in `CSVFileFormat`, `// TODO: Move filtering.`, I guess we don't have to check this duplicately.

   Currently, this seems being filtered in `PartitioningAwareFileIndex.shouldFilterOut` and`PartitioningAwareFileIndex.isDataPath`. So, `FileFormat.inferSchema` will always receive leaf files. For example, running to codes below:

   ``` scala
   spark.range(10).withColumn("_locality_code", $"id").write.partitionBy("_locality_code").save("/tmp/parquet")
   spark.read.parquet("/tmp/parquet")
   ```

   gives the paths below without directories but just valid data files:

   ``` bash
   /tmp/parquet/_col=0/part-r-00000-094a8efa-bece-4b50-b54c-7918d1f7b3f8.snappy.parquet
   /tmp/parquet/_col=1/part-r-00000-094a8efa-bece-4b50-b54c-7918d1f7b3f8.snappy.parquet
   /tmp/parquet/_col=2/part-r-00000-25de2b50-225a-4bcf-a2bc-9eb9ed407ef6.snappy.parquet
   ...
   ```

   to `FileFormat.inferSchema`.

## How was this patch tested?

Unit test added in `HadoopFsRelationTest` and related existing tests.

Author: hyukjinkwon <gurwls223@gmail.com>

Closes apache#14627 from HyukjinKwon/SPARK-16975.
@dongjoon-hyun dongjoon-hyun deleted the SPARK-16975-PARQUET branch January 17, 2018 09:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants