[SPARK-27627][SQL] Make option "pathGlobFilter" as a general option for all file sources #24518

gengliangwang · 2019-05-02T23:52:35Z

What changes were proposed in this pull request?

Background:

The data source option pathGlobFilter is introduced for Binary file format: #24354 , which can be used for filtering file names, e.g. reading .png files only while there is .json files in the same directory.

Proposal:

Make the option pathGlobFilter as a general option for all file sources. The path filtering should happen in the path globbing on Driver.

Motivation:

Filtering the file path names in file scan tasks on executors is kind of ugly.

Impact:

The splitting of file partitions will be more balanced.
The metrics of file scan will be more accurate.
Users can use the option for reading other file sources.

How was this patch tested?

Unit tests

...e/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningAwareFileIndex.scala

docs/sql-migration-guide-upgrade.md

HyukjinKwon · 2019-05-03T00:36:33Z

I think this option also conflicts with Avro's ignoreExtension

dongjoon-hyun · 2019-05-03T00:37:15Z

...e/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningAwareFileIndex.scala

Nice, @gengliangwang . BTW, can we put matchGlobPattern(f) first like matchGlobPattern(f) && isNonEmptyFile(f) in order to avoid f.getLen more?

SparkQA · 2019-05-03T02:47:13Z

Test build #105093 has finished for PR 24518 at commit c7bf17b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

mengxr · 2019-05-03T03:52:27Z

cc: @WeichenXu123

mengxr · 2019-05-03T03:54:06Z

@gengliangwang Could you update binary file user guide and API docs?

gengliangwang · 2019-05-04T05:09:42Z

I think this option also conflicts with Avro's ignoreExtension

@HyukjinKwon that's true. But I think we have to keep both options effective...

SparkQA · 2019-05-04T07:05:01Z

Test build #105121 has finished for PR 24518 at commit 53e9b6b.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-05-04T07:05:01Z

Test build #105120 has finished for PR 24518 at commit a005b0b.

This patch fails due to an unknown error code, -9.
This patch does not merge cleanly.
This patch adds no public classes.

HyukjinKwon · 2019-05-06T11:24:32Z

retest this please

HyukjinKwon · 2019-05-06T11:33:21Z

Can we deprecate ignoreExtension? We can just note that option is deprecated somewhere as of pathGlobFilter and users should set pathGlobFilter to .avro.

sql/core/src/main/scala/org/apache/spark/sql/streaming/DataStreamReader.scala

SparkQA · 2019-05-06T14:16:15Z

Test build #105148 has finished for PR 24518 at commit 53e9b6b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gengliangwang · 2019-05-06T23:31:26Z

I think I have addressed all the comments, please review this again.
@gatorsmile @HyukjinKwon @mengxr @dongjoon-hyun @WeichenXu123

mengxr · 2019-05-06T23:58:21Z

docs/sql-data-sources-binaryFile.md

-</table>
-
 To read whole binary files, you need to specify the data source `format` as `binaryFile`.
-For example, the following code reads all PNG files from the input directory:


Can we keep the pathGlobFilter option in the example? It is actually important for the use case. Just mention pathGlobFilter is a global option.

Sure, I will revert this.

SparkQA · 2019-05-07T02:42:14Z

Test build #105175 has finished for PR 24518 at commit dcab4f9.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-05-07T03:47:26Z

Test build #105179 has finished for PR 24518 at commit ddf1874.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

external/avro/src/main/scala/org/apache/spark/sql/avro/AvroFileFormat.scala

python/pyspark/sql/readwriter.py

HyukjinKwon · 2019-05-07T12:37:23Z

I am okay with this one.

SparkQA · 2019-05-07T21:20:56Z

Test build #105225 has finished for PR 24518 at commit d8f8420.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2019-05-08T23:41:02Z

Merged to master.

HyukjinKwon · 2019-11-30T04:24:50Z

python/pyspark/sql/readwriter.py

            * ``timeZone``: sets the string that indicates a timezone to be used to parse timestamps
                in the JSON/CSV datasources or partition values.
                If it isn't set, it uses the default value, session local timezone.
+            * ``pathGlobFilter``: an optional glob pattern to only include files with paths matching


Sorry, actually can we move this documentation to each implementation of CSV, Parquet, ORC, text? It will only work with such internal file based sources.

…p' and 'pathGlobFilter' in file sources 'mergeSchema' in ORC ### What changes were proposed in this pull request? This PR adds and exposes the options, 'recursiveFileLookup' and 'pathGlobFilter' in file sources 'mergeSchema' in ORC, into documentation. - `recursiveFileLookup` at file sources: #24830 ([SPARK-27627](https://issues.apache.org/jira/browse/SPARK-27627)) - `pathGlobFilter` at file sources: #24518 ([SPARK-27990](https://issues.apache.org/jira/browse/SPARK-27990)) - `mergeSchema` at ORC: #24043 ([SPARK-11412](https://issues.apache.org/jira/browse/SPARK-11412)) **Note that** `timeZone` option was not moved from `DataFrameReader.options` as I assume it will likely affect other datasources as well once DSv2 is complete. ### Why are the changes needed? To document available options in sources properly. ### Does this PR introduce any user-facing change? In PySpark, `pathGlobFilter` can be set via `DataFrameReader.(text|orc|parquet|json|csv)` and `DataStreamReader.(text|orc|parquet|json|csv)`. ### How was this patch tested? Manually built the doc and checked the output. Option setting in PySpark is rather a logical change. I manually tested one only: ```bash $ ls -al tmp ... -rw-r--r-- 1 hyukjin.kwon staff 3 Dec 20 12:19 aa -rw-r--r-- 1 hyukjin.kwon staff 3 Dec 20 12:19 ab -rw-r--r-- 1 hyukjin.kwon staff 3 Dec 20 12:19 ac -rw-r--r-- 1 hyukjin.kwon staff 3 Dec 20 12:19 cc ``` ```python >>> spark.read.text("tmp", pathGlobFilter="*c").show() ``` ``` +-----+ |value| +-----+ | ac| | cc| +-----+ ``` Closes #26958 from HyukjinKwon/doc-followup. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>

MaxGekk · 2020-01-07T19:18:26Z

external/avro/src/main/scala/org/apache/spark/sql/avro/AvroOptions.scala

   * If the option is not set, the Hadoop's config `avro.mapred.ignore.inputs.without.extension`
   * is taken into account. If the former one is not set too, file extensions are ignored.
   */
+  @deprecated("Use the general data source option pathGlobFilter for filtering file names", "3.0")


I am wondering whom is this deprecation warning to? Spark users don't use ignoreExtension directly. I do think we should print a warning when we read & detect that AvroFileFormat.IgnoreFilesWithoutExtensionProperty and/or AvroOptions.ignoreExtensionKey are set otherwise users will never see the deprecation.

I think it already shows an warning at https://github.com/apache/spark/pull/24518/files/d8f8420d9d3c97f96c1e09855e008ece3f275ad3#diff-8b28467c7f7a28d7fcf208a613a373c8R61

only in one case at schema inferring. I would remove this annotation and print warning in initialization of AvroOptions. The deprecation warning is printed only while Spark compilation which is useless for users.

I think we should remove deprecated.

It would be great if we can put that logic into AvroOptions e.g.:

parameters .get(AvroOptions.ignoreExtensionKey) .map { v => logWarning(...) v.toBoolean }.getOrElse(!ignoreFilesWithoutExtension)

However, can you make it doesn't show the logs too many times? If we put there, seems like it will show the same logs multiple times.

If you can find a better way, please go and open a PR (and some nits I picked below)

HyukjinKwon · 2020-01-08T07:23:38Z

external/avro/src/main/scala/org/apache/spark/sql/avro/AvroOptions.scala


    parameters
-      .get("ignoreExtension")
+      .get(AvroOptions.ignoreExtensionKey)


ignoreExtensionKey -> IGNORE_EXTENTION_KEY to be consistent with other XXXOptions

HyukjinKwon · 2020-01-08T07:23:54Z

external/avro/src/main/scala/org/apache/spark/sql/avro/AvroFileFormat.scala

      options: Map[String, String],
      files: Seq[FileStatus]): Option[StructType] = {
    val conf = spark.sessionState.newHadoopConf()
+    if (options.contains("ignoreExtension")) {


"ignoreExtension " -> AvroOptions.ignoreExtensionKey

…or all file sources The data source option `pathGlobFilter` is introduced for Binary file format: apache#24354 , which can be used for filtering file names, e.g. reading `.png` files only while there is `.json` files in the same directory. Make the option `pathGlobFilter` as a general option for all file sources. The path filtering should happen in the path globbing on Driver. Filtering the file path names in file scan tasks on executors is kind of ugly. 1. The splitting of file partitions will be more balanced. 2. The metrics of file scan will be more accurate. 3. Users can use the option for reading other file sources. Unit tests Closes apache#24518 from gengliangwang/globFilter. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>

HyukjinKwon reviewed May 3, 2019

View reviewed changes

...e/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningAwareFileIndex.scala Outdated Show resolved Hide resolved

HyukjinKwon reviewed May 3, 2019

View reviewed changes

docs/sql-migration-guide-upgrade.md Outdated Show resolved Hide resolved

dongjoon-hyun reviewed May 3, 2019

View reviewed changes

globFilter

53e9b6b

gengliangwang force-pushed the globFilter branch from a005b0b to 53e9b6b Compare May 4, 2019 05:31

gengliangwang changed the title ~~[SPARK-27627][SQL] Make option "pathGlobFilter" as a general option for all file sources~~ [WIP][SPARK-27627][SQL] Make option "pathGlobFilter" as a general option for all file sources May 4, 2019

HyukjinKwon reviewed May 6, 2019

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/streaming/DataStreamReader.scala Show resolved Hide resolved

gengliangwang added 3 commits May 6, 2019 13:41

update python comments

5aac13c

update sql-data-sources-load-save-functions.md and example code

a308f3f

deprecated Avro option: ignoreExtension

dcab4f9

gengliangwang changed the title ~~[WIP][SPARK-27627][SQL] Make option "pathGlobFilter" as a general option for all file sources~~ [SPARK-27627][SQL] Make option "pathGlobFilter" as a general option for all file sources May 6, 2019

remove one empty line

4d2c6e1

mengxr requested changes May 6, 2019

View reviewed changes

revise

ddf1874

mengxr approved these changes May 7, 2019

View reviewed changes

HyukjinKwon reviewed May 7, 2019

View reviewed changes

external/avro/src/main/scala/org/apache/spark/sql/avro/AvroFileFormat.scala Show resolved Hide resolved

HyukjinKwon reviewed May 7, 2019

View reviewed changes

python/pyspark/sql/readwriter.py Outdated Show resolved Hide resolved

HyukjinKwon reviewed May 7, 2019

View reviewed changes

python/pyspark/sql/readwriter.py Show resolved Hide resolved

gengliangwang added 2 commits May 7, 2019 11:01

address comment

5d678e2

address comment

d8f8420

HyukjinKwon closed this in 78a403f May 8, 2019

gengliangwang mentioned this pull request Jul 5, 2019

[SPARK-28218][SQL] Migrate Avro to File Data Source V2 #25017

Closed

HyukjinKwon reviewed Nov 30, 2019

View reviewed changes

HyukjinKwon mentioned this pull request Dec 20, 2019

[SPARK-30128][DOCS][PYTHON][SQL] Document/promote 'recursiveFileLookup' and 'pathGlobFilter' in file sources 'mergeSchema' in ORC #26958

Closed

MaxGekk reviewed Jan 7, 2020

View reviewed changes

HyukjinKwon reviewed Jan 8, 2020

View reviewed changes

octo-sts bot mentioned this pull request Dec 14, 2024

Updated spark with scala and python wolfi-dev/os#36997

Closed

1 task

[SPARK-27627][SQL] Make option "pathGlobFilter" as a general option for all file sources #24518

[SPARK-27627][SQL] Make option "pathGlobFilter" as a general option for all file sources #24518

Uh oh!

Conversation

gengliangwang commented May 2, 2019

What changes were proposed in this pull request?

Background:

Proposal:

Motivation:

Impact:

How was this patch tested?

Uh oh!

Uh oh!

Uh oh!

HyukjinKwon commented May 3, 2019

Uh oh!

dongjoon-hyun May 3, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented May 3, 2019

Uh oh!

mengxr commented May 3, 2019

Uh oh!

mengxr commented May 3, 2019

Uh oh!

gengliangwang commented May 4, 2019

Uh oh!

SparkQA commented May 4, 2019

Uh oh!

SparkQA commented May 4, 2019

Uh oh!

HyukjinKwon commented May 6, 2019

Uh oh!

HyukjinKwon commented May 6, 2019

Uh oh!

Uh oh!

SparkQA commented May 6, 2019

Uh oh!

gengliangwang commented May 6, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented May 7, 2019

Uh oh!

SparkQA commented May 7, 2019

Uh oh!

Uh oh!

Uh oh!

Uh oh!

HyukjinKwon commented May 7, 2019

Uh oh!

SparkQA commented May 7, 2019

Uh oh!

HyukjinKwon commented May 8, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

dongjoon-hyun May 3, 2019 •

edited

Loading