-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-27627][SQL] Make option "pathGlobFilter" as a general option for all file sources #24518
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
...e/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningAwareFileIndex.scala
Outdated
Show resolved
Hide resolved
|
I think this option also conflicts with Avro's |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, @gengliangwang . BTW, can we put matchGlobPattern(f) first like matchGlobPattern(f) && isNonEmptyFile(f) in order to avoid f.getLen more?
|
Test build #105093 has finished for PR 24518 at commit
|
|
cc: @WeichenXu123 |
|
@gengliangwang Could you update binary file user guide and API docs? |
@HyukjinKwon that's true. But I think we have to keep both options effective... |
|
Test build #105121 has finished for PR 24518 at commit
|
|
Test build #105120 has finished for PR 24518 at commit
|
|
retest this please |
|
Can we deprecate |
sql/core/src/main/scala/org/apache/spark/sql/streaming/DataStreamReader.scala
Show resolved
Hide resolved
|
Test build #105148 has finished for PR 24518 at commit
|
|
I think I have addressed all the comments, please review this again. |
| </table> | ||
|
|
||
| To read whole binary files, you need to specify the data source `format` as `binaryFile`. | ||
| For example, the following code reads all PNG files from the input directory: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we keep the pathGlobFilter option in the example? It is actually important for the use case. Just mention pathGlobFilter is a global option.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I will revert this.
|
Test build #105175 has finished for PR 24518 at commit
|
|
Test build #105179 has finished for PR 24518 at commit
|
external/avro/src/main/scala/org/apache/spark/sql/avro/AvroFileFormat.scala
Show resolved
Hide resolved
|
I am okay with this one. |
|
Test build #105225 has finished for PR 24518 at commit
|
|
Merged to master. |
| * ``timeZone``: sets the string that indicates a timezone to be used to parse timestamps | ||
| in the JSON/CSV datasources or partition values. | ||
| If it isn't set, it uses the default value, session local timezone. | ||
| * ``pathGlobFilter``: an optional glob pattern to only include files with paths matching |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, actually can we move this documentation to each implementation of CSV, Parquet, ORC, text? It will only work with such internal file based sources.
…p' and 'pathGlobFilter' in file sources 'mergeSchema' in ORC ### What changes were proposed in this pull request? This PR adds and exposes the options, 'recursiveFileLookup' and 'pathGlobFilter' in file sources 'mergeSchema' in ORC, into documentation. - `recursiveFileLookup` at file sources: #24830 ([SPARK-27627](https://issues.apache.org/jira/browse/SPARK-27627)) - `pathGlobFilter` at file sources: #24518 ([SPARK-27990](https://issues.apache.org/jira/browse/SPARK-27990)) - `mergeSchema` at ORC: #24043 ([SPARK-11412](https://issues.apache.org/jira/browse/SPARK-11412)) **Note that** `timeZone` option was not moved from `DataFrameReader.options` as I assume it will likely affect other datasources as well once DSv2 is complete. ### Why are the changes needed? To document available options in sources properly. ### Does this PR introduce any user-facing change? In PySpark, `pathGlobFilter` can be set via `DataFrameReader.(text|orc|parquet|json|csv)` and `DataStreamReader.(text|orc|parquet|json|csv)`. ### How was this patch tested? Manually built the doc and checked the output. Option setting in PySpark is rather a logical change. I manually tested one only: ```bash $ ls -al tmp ... -rw-r--r-- 1 hyukjin.kwon staff 3 Dec 20 12:19 aa -rw-r--r-- 1 hyukjin.kwon staff 3 Dec 20 12:19 ab -rw-r--r-- 1 hyukjin.kwon staff 3 Dec 20 12:19 ac -rw-r--r-- 1 hyukjin.kwon staff 3 Dec 20 12:19 cc ``` ```python >>> spark.read.text("tmp", pathGlobFilter="*c").show() ``` ``` +-----+ |value| +-----+ | ac| | cc| +-----+ ``` Closes #26958 from HyukjinKwon/doc-followup. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>
| * If the option is not set, the Hadoop's config `avro.mapred.ignore.inputs.without.extension` | ||
| * is taken into account. If the former one is not set too, file extensions are ignored. | ||
| */ | ||
| @deprecated("Use the general data source option pathGlobFilter for filtering file names", "3.0") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am wondering whom is this deprecation warning to? Spark users don't use ignoreExtension directly. I do think we should print a warning when we read & detect that AvroFileFormat.IgnoreFilesWithoutExtensionProperty and/or AvroOptions.ignoreExtensionKey are set otherwise users will never see the deprecation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it already shows an warning at https://github.com/apache/spark/pull/24518/files/d8f8420d9d3c97f96c1e09855e008ece3f275ad3#diff-8b28467c7f7a28d7fcf208a613a373c8R61
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
only in one case at schema inferring. I would remove this annotation and print warning in initialization of AvroOptions. The deprecation warning is printed only while Spark compilation which is useless for users.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should remove deprecated.
It would be great if we can put that logic into AvroOptions e.g.:
parameters
.get(AvroOptions.ignoreExtensionKey)
.map { v =>
logWarning(...)
v.toBoolean
}.getOrElse(!ignoreFilesWithoutExtension)However, can you make it doesn't show the logs too many times? If we put there, seems like it will show the same logs multiple times.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you can find a better way, please go and open a PR (and some nits I picked below)
|
|
||
| parameters | ||
| .get("ignoreExtension") | ||
| .get(AvroOptions.ignoreExtensionKey) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ignoreExtensionKey -> IGNORE_EXTENTION_KEY to be consistent with other XXXOptions
| options: Map[String, String], | ||
| files: Seq[FileStatus]): Option[StructType] = { | ||
| val conf = spark.sessionState.newHadoopConf() | ||
| if (options.contains("ignoreExtension")) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"ignoreExtension " -> AvroOptions.ignoreExtensionKey
…or all file sources The data source option `pathGlobFilter` is introduced for Binary file format: apache#24354 , which can be used for filtering file names, e.g. reading `.png` files only while there is `.json` files in the same directory. Make the option `pathGlobFilter` as a general option for all file sources. The path filtering should happen in the path globbing on Driver. Filtering the file path names in file scan tasks on executors is kind of ugly. 1. The splitting of file partitions will be more balanced. 2. The metrics of file scan will be more accurate. 3. Users can use the option for reading other file sources. Unit tests Closes apache#24518 from gengliangwang/globFilter. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>
What changes were proposed in this pull request?
Background:
The data source option
pathGlobFilteris introduced for Binary file format: #24354 , which can be used for filtering file names, e.g. reading.pngfiles only while there is.jsonfiles in the same directory.Proposal:
Make the option
pathGlobFilteras a general option for all file sources. The path filtering should happen in the path globbing on Driver.Motivation:
Filtering the file path names in file scan tasks on executors is kind of ugly.
Impact:
How was this patch tested?
Unit tests