-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-16317][SQL] Add a new interface to filter files in FileFormat #14038
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #61701 has finished for PR 14038 at commit
|
|
@liancheng Could you review this after v2.0 released? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we add either the data source options map or the Hadoop conf as an argument of this method?
For example, the Avro data source may filter out all input files whose file names don't end with ".avro" if Hadoop conf "avro.mapred.ignore.inputs.without.extension" is set to true. This is consistent with default behavior of AvroInputFormat.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
okay, I'll fix now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the semantics of the return value of the method? Seems that it should never return a null filter since it defaults to an "accept all" filter. If this is true, it's unnecessary to use Option to wrap returned filters elsewhere in this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea, my bad. I'll re-check the whole code to remove Option.
|
Left some comments, the overall structure looks good. Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be conciser:
(filter1 ++ filter2).reduceOption { (f1, f2) =>
(path: Path) => f1.accept(path) && f2.accept(path)
}.getOrElse {
(path: Path) => true
}|
Test build #61750 has finished for PR 14038 at commit
|
|
@liancheng okay, re-check please. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe just rename it as PathFilter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
okay
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, it probably makes more sense to move this class into fileSourceInterfaces.scala since it's part of the public interface.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea, fixed.
|
cc @rxin |
|
okay, updated. |
|
Test build #61846 has finished for PR 14038 at commit
|
|
ping @rxin |
|
Test build #64049 has finished for PR 14038 at commit
|
|
Test build #64062 has finished for PR 14038 at commit
|
|
Test build #64075 has finished for PR 14038 at commit
|
|
ping @rxin @liancheng |
|
Path filtering in Hadoop FS calls on anything other than filename is very suboptimal; in #14731 you can see where the filtering has been postoned until after the listing, when the full As filtering is the last operation in the various listFiles calls, there's no penalty to doing the filtering after the results come in. In I think a new filter should be executed after these operations, taking the Consider also taking a predicate I note you are making extensive use of |
|
@steveloughran Thank for the comment and good suggestion. Seems you'd better off opening a new JIRA ticket to discuss more there. btw, do you know how the recursion you pointed out makes big impacts on actual performance? Could you have any performance results for that? |
|
Oh, i don't want to take on any more work...I just think you should make the predicate passed in something that goes Regarding speedup, we've seen 20x in simple test trees, but don't have real data on how representative that is: HADOOP-13208 |
|
If my understanding is correct, |
|
There's no performance problem from filtering just on names. It's when people try to filter on more complex things (file type, timestamp) they need to call I've been through Spark looking at where anything like that is being done, and have a patch to fix it....i don't want to have to do the same thing again in future; something we can avoid by having a richer filter which passes the Now, you may think "why doesn't Hadoop's list/glob operations take a richer predicate?". That I can't answer, it's history is lost in the oldest bits of the code. |
|
Understood though, it seems this is a difficult issue because I'm not 100% sure how rich we should need for the filter interface (at least |
|
@liancheng I'm not sure that the original motivation keeps alive in SPARK-16317 though, if I need to do something, please let me know. I made new code based on this pr (master...maropu:SPARK-16317-2) because I found the file listing class had been refactored recently. Thanks! |
|
@maropu if you create a PR for your work I'll comment on it |
|
Do we have a strong need for this? Anything wrong with just filtering out all file names that start with _ and .? |
|
yea, as for data files, it's okay to filter out '_' and '.'. But, the file pattens of metadata depend on file formats as suggested in https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningAwareFileIndex.scala#L433 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd consider adding the full set of invalid files:
p1=2/file=3 -> 1
p1=2/.temp -> 1
|
Test build #71384 has finished for PR 14038 at commit
|
|
Test build #71388 has finished for PR 14038 at commit
|
|
Test build #71389 has started for PR 14038 at commit |
|
Jenkins, retest this please. |
|
Test build #71391 has finished for PR 14038 at commit
|
|
@liancheng Could you check this? |
|
Test build #71899 has finished for PR 14038 at commit
|
|
@liancheng ping |
|
Test build #74927 has finished for PR 14038 at commit
|
|
Test build #74964 has finished for PR 14038 at commit
|
|
Jenkins, retest this please. |
|
Test build #75132 has finished for PR 14038 at commit
|
| // 2. everything that ends with `._COPYING_`, because this is a intermediate state of file. we | ||
| // should skip this file in case of double reading. | ||
| val name = path.getName | ||
| !((name.startsWith("_") && !name.contains("=")) || name.startsWith(".") || |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Like @rxin said, this sounds risky to me too.
|
Could we first close this PR? We can revisit it later? |
|
@gatorsmile ok, I'll close this for now. Thanks! |
What changes were proposed in this pull request?
This pr is to add an interface for filtering files in
FileFormatnot to pass invalid files intoFileFormat#buildReader.How was this patch tested?
Added tests to filter files in a driver and in parallel.