[SPARK-16317][SQL] Add a new interface to filter files in FileFormat #14038

maropu · 2016-07-04T02:53:44Z

What changes were proposed in this pull request?

This pr is to add an interface for filtering files in FileFormat not to pass invalid files into FileFormat#buildReader.

How was this patch tested?

Added tests to filter files in a driver and in parallel.

SparkQA · 2016-07-04T04:35:03Z

Test build #61701 has finished for PR 14038 at commit 6770309.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2016-07-04T04:37:09Z

@liancheng Could you review this after v2.0 released?

liancheng · 2016-07-05T07:39:36Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/fileSourceInterfaces.scala

Shall we add either the data source options map or the Hadoop conf as an argument of this method?

For example, the Avro data source may filter out all input files whose file names don't end with ".avro" if Hadoop conf "avro.mapred.ignore.inputs.without.extension" is set to true. This is consistent with default behavior of AvroInputFormat.

okay, I'll fix now

What is the semantics of the return value of the method? Seems that it should never return a null filter since it defaults to an "accept all" filter. If this is true, it's unnecessary to use Option to wrap returned filters elsewhere in this PR.

yea, my bad. I'll re-check the whole code to remove Option.

liancheng · 2016-07-05T07:56:37Z

Left some comments, the overall structure looks good. Thanks!

liancheng · 2016-07-05T07:58:40Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/fileSourceInterfaces.scala

This can be conciser:

(filter1 ++ filter2).reduceOption { (f1, f2) => (path: Path) => f1.accept(path) && f2.accept(path) }.getOrElse { (path: Path) => true }

SparkQA · 2016-07-05T10:53:21Z

Test build #61750 has finished for PR 14038 at commit f032a4e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2016-07-05T10:55:00Z

@liancheng okay, re-check please.

liancheng · 2016-07-06T10:13:58Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/ListingFileCatalog.scala

Maybe just rename it as PathFilter.

Also, it probably makes more sense to move this class into fileSourceInterfaces.scala since it's part of the public interface.

yea, fixed.

liancheng · 2016-07-06T10:18:33Z

cc @rxin

maropu · 2016-07-06T11:26:56Z

okay, updated.

SparkQA · 2016-07-06T13:20:51Z

Test build #61846 has finished for PR 14038 at commit 5115d26.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- abstract class PathFilter extends Serializable

maropu · 2016-07-11T01:59:13Z

ping @rxin

SparkQA · 2016-08-19T08:27:24Z

Test build #64049 has finished for PR 14038 at commit d53ad8e.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- abstract class PathFilter extends Serializable

SparkQA · 2016-08-19T12:10:58Z

Test build #64062 has finished for PR 14038 at commit 60f05ad.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-08-19T18:15:39Z

Test build #64075 has finished for PR 14038 at commit c3e046f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2016-08-20T11:45:00Z

ping @rxin @liancheng

steveloughran · 2016-08-20T20:10:59Z

Path filtering in Hadoop FS calls on anything other than filename is very suboptimal; in #14731 you can see where the filtering has been postoned until after the listing, when the full FileStatus entry list has been returned.

As filtering is the last operation in the various listFiles calls, there's no penalty to doing the filtering after the results come in. In FileSytem.globStatus() the filtering takes place after the glob match, but during the scan...a larger list will be built and returned, but that is all.

I think a new filter should be executed after these operations, taking the FileStatus object, this provides a superset of filtering possible within the Hadoop calls (timestamp, filetype, ...), with no performance penalty. It's more flexible than the simple accept(path), and will guarantee that nobody using the API will implement a suboptimal filter.

Consider also taking a predicate Filesystem => Boolean, rather than requiring callers to implement new classes. It can be fed straight into Iterator.filter().

I note you are making extensive use of listLeafFiles; that's a potentially inefficent implementation against object stores. Keep using it —I'll patch it to use FileSystem.listFiles(path, true) for in FS recursion and O(files/5000) listing against S3A in Hadoop 2.8; eventually Azure and swoft

maropu · 2016-08-22T14:14:25Z

@steveloughran Thank for the comment and good suggestion. Seems you'd better off opening a new JIRA ticket to discuss more there. btw, do you know how the recursion you pointed out makes big impacts on actual performance? Could you have any performance results for that?

steveloughran · 2016-08-22T15:41:36Z

Oh, i don't want to take on any more work...I just think you should make the predicate passed in something that goes FileStatus => Boolean instead of String => Boolean, and doing the filtering after the results come back.

Regarding speedup, we've seen 20x in simple test trees, but don't have real data on how representative that is: HADOOP-13208

maropu · 2016-08-22T15:54:51Z

If my understanding is correct, PathFilter is not passed into FileSystem.listFiles in ListingFileCatalog#listLeafFiles inside. If even so, the performance degrades you pointed out occur?

steveloughran · 2016-08-22T18:51:28Z

There's no performance problem from filtering just on names. It's when people try to filter on more complex things (file type, timestamp) they need to call getFileStatus(path) and that's the performance problem.

I've been through Spark looking at where anything like that is being done, and have a patch to fix it....i don't want to have to do the same thing again in future; something we can avoid by having a richer filter which passes the FileStatus generated in the listing process.

Now, you may think "why doesn't Hadoop's list/glob operations take a richer predicate?". That I can't answer, it's history is lost in the oldest bits of the code.

maropu · 2016-08-23T08:41:20Z

Understood though, it seems this is a difficult issue because I'm not 100% sure how rich we should need for the filter interface (at least timestamp and file type is not used for now when filtering files in listingFileCatalog#listLeafFiles) and FileStatus adds a hadoop dependency. What do u think? cc: @rxin @liancheng If reasonable, I'll change the interface to FileStatus=>Boolean.

maropu · 2016-11-18T13:46:42Z

@liancheng I'm not sure that the original motivation keeps alive in SPARK-16317 though, if I need to do something, please let me know. I made new code based on this pr (master...maropu:SPARK-16317-2) because I found the file listing class had been refactored recently. Thanks!

steveloughran · 2016-11-18T14:31:09Z

@maropu if you create a PR for your work I'll comment on it

rxin · 2016-11-21T07:07:48Z

Do we have a strong need for this? Anything wrong with just filtering out all file names that start with _ and .?

maropu · 2016-11-21T07:27:57Z

yea, as for data files, it's okay to filter out '_' and '.'. But, the file pattens of metadata depend on file formats as suggested in https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningAwareFileIndex.scala#L433

steveloughran · 2016-11-28T17:55:59Z

...core/src/test/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategySuite.scala

I'd consider adding the full set of invalid files:

p1=2/file=3 -> 1 p1=2/.temp -> 1

SparkQA · 2017-01-15T02:30:44Z

Test build #71384 has finished for PR 14038 at commit 4e3628b.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
abstract class PathFilter extends Serializable
class MetadataLogFileIndex(sparkSession: SparkSession, path: Path, pathFilter: PathFilter)

SparkQA · 2017-01-15T05:40:40Z

Test build #71388 has finished for PR 14038 at commit 85b0f61.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
abstract class PathFilter extends Serializable
class MetadataLogFileIndex(sparkSession: SparkSession, path: Path, pathFilter: PathFilter)

SparkQA · 2017-01-15T06:42:41Z

Test build #71389 has started for PR 14038 at commit d08ff73.

maropu · 2017-01-15T09:14:54Z

Jenkins, retest this please.

SparkQA · 2017-01-15T11:33:04Z

Test build #71391 has finished for PR 14038 at commit d08ff73.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
abstract class PathFilter extends Serializable
class MetadataLogFileIndex(sparkSession: SparkSession, path: Path, pathFilter: PathFilter)

maropu · 2017-01-15T12:18:54Z

@liancheng Could you check this?

SparkQA · 2017-01-24T04:06:11Z

Test build #71899 has finished for PR 14038 at commit 9284827.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
abstract class PathFilter extends Serializable
class MetadataLogFileIndex(sparkSession: SparkSession, path: Path, pathFilter: PathFilter)

maropu · 2017-01-24T06:24:13Z

@liancheng ping

maropu · 2017-02-04T03:11:29Z

@liancheng ping

maropu · 2017-02-13T22:39:49Z

@liancheng ping

SparkQA · 2017-03-21T02:40:04Z

Test build #74927 has finished for PR 14038 at commit c89a876.

This patch fails to build.
This patch merges cleanly.
This patch adds the following public classes (experimental):
abstract class PathFilter extends Serializable
class MetadataLogFileIndex(sparkSession: SparkSession, path: Path, pathFilter: PathFilter)

SparkQA · 2017-03-21T10:13:17Z

Test build #74964 has finished for PR 14038 at commit b405861.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
abstract class PathFilter extends Serializable
class MetadataLogFileIndex(sparkSession: SparkSession, path: Path, pathFilter: PathFilter)

maropu · 2017-03-24T01:10:01Z

Jenkins, retest this please.

SparkQA · 2017-03-24T03:13:57Z

Test build #75132 has finished for PR 14038 at commit b405861.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
abstract class PathFilter extends Serializable
class MetadataLogFileIndex(sparkSession: SparkSession, path: Path, pathFilter: PathFilter)

gatorsmile · 2017-05-23T16:49:26Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormat.scala

+      // 2. everything that ends with `._COPYING_`, because this is a intermediate state of file. we
+      // should skip this file in case of double reading.
+      val name = path.getName
+      !((name.startsWith("_") && !name.contains("=")) || name.startsWith(".") ||


Like @rxin said, this sounds risky to me too.

gatorsmile · 2017-05-23T16:49:58Z

Could we first close this PR? We can revisit it later?

maropu · 2017-05-23T22:02:38Z

@gatorsmile ok, I'll close this for now. Thanks!

maropu closed this Jul 4, 2016

maropu reopened this Jul 4, 2016

liancheng reviewed Jul 5, 2016
View reviewed changes

liancheng reviewed Jul 6, 2016
View reviewed changes

maropu force-pushed the SPARK-16317 branch from 5115d26 to d53ad8e Compare August 19, 2016 06:55

maropu force-pushed the SPARK-16317 branch from 60f05ad to c3e046f Compare August 19, 2016 16:12

steveloughran reviewed Nov 28, 2016

View reviewed changes

maropu force-pushed the SPARK-16317 branch from c3e046f to 4e3628b Compare January 15, 2017 01:04

maropu force-pushed the SPARK-16317 branch from 4e3628b to 85b0f61 Compare January 15, 2017 03:26

maropu force-pushed the SPARK-16317 branch from 85b0f61 to d08ff73 Compare January 15, 2017 06:38

maropu force-pushed the SPARK-16317 branch from d08ff73 to 9284827 Compare January 24, 2017 01:43

maropu force-pushed the SPARK-16317 branch from 9284827 to c89a876 Compare March 21, 2017 02:28

Add a new interface to filter files in FileFormat

b405861

maropu force-pushed the SPARK-16317 branch from c89a876 to b405861 Compare March 21, 2017 08:37

gatorsmile reviewed May 23, 2017

View reviewed changes

maropu closed this May 23, 2017

[SPARK-16317][SQL] Add a new interface to filter files in FileFormat #14038

[SPARK-16317][SQL] Add a new interface to filter files in FileFormat #14038

Uh oh!

Conversation

maropu commented Jul 4, 2016

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Jul 4, 2016

Uh oh!

maropu commented Jul 4, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

maropu Jul 5, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

liancheng commented Jul 5, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jul 5, 2016

Uh oh!

maropu commented Jul 5, 2016

Uh oh!

liancheng Jul 6, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

liancheng commented Jul 6, 2016

Uh oh!

maropu commented Jul 6, 2016

Uh oh!

SparkQA commented Jul 6, 2016

Uh oh!

maropu commented Jul 11, 2016

Uh oh!

SparkQA commented Aug 19, 2016

Uh oh!

SparkQA commented Aug 19, 2016

Uh oh!

SparkQA commented Aug 19, 2016

Uh oh!

maropu commented Aug 20, 2016

Uh oh!

steveloughran commented Aug 20, 2016

Uh oh!

maropu commented Aug 22, 2016

Uh oh!

steveloughran commented Aug 22, 2016

Uh oh!

maropu commented Aug 22, 2016

Uh oh!

steveloughran commented Aug 22, 2016

Uh oh!

maropu commented Aug 23, 2016

Uh oh!

maropu commented Nov 18, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

steveloughran commented Nov 18, 2016

Uh oh!

rxin commented Nov 21, 2016

Uh oh!

maropu commented Nov 21, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jan 15, 2017

maropu Jul 5, 2016 •

edited

Loading

liancheng Jul 6, 2016 •

edited

Loading

maropu commented Nov 18, 2016 •

edited

Loading