Skip to content

Conversation

@gengliangwang
Copy link
Member

What changes were proposed in this pull request?

Background:

The data source option pathGlobFilter is introduced for Binary file format: #24354 , which can be used for filtering file names, e.g. reading .png files only while there is .json files in the same directory.

Proposal:

Make the option pathGlobFilter as a general option for all file sources. The path filtering should happen in the path globbing on Driver.

Motivation:

Filtering the file path names in file scan tasks on executors is kind of ugly.

Impact:

  1. The splitting of file partitions will be more balanced.
  2. The metrics of file scan will be more accurate.
  3. Users can use the option for reading other file sources.

How was this patch tested?

Unit tests

@HyukjinKwon
Copy link
Member

I think this option also conflicts with Avro's ignoreExtension

Copy link
Member

@dongjoon-hyun dongjoon-hyun May 3, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, @gengliangwang . BTW, can we put matchGlobPattern(f) first like matchGlobPattern(f) && isNonEmptyFile(f) in order to avoid f.getLen more?

@SparkQA
Copy link

SparkQA commented May 3, 2019

Test build #105093 has finished for PR 24518 at commit c7bf17b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@mengxr
Copy link
Contributor

mengxr commented May 3, 2019

cc: @WeichenXu123

@mengxr
Copy link
Contributor

mengxr commented May 3, 2019

@gengliangwang Could you update binary file user guide and API docs?

@gengliangwang
Copy link
Member Author

I think this option also conflicts with Avro's ignoreExtension

@HyukjinKwon that's true. But I think we have to keep both options effective...

@gengliangwang gengliangwang changed the title [SPARK-27627][SQL] Make option "pathGlobFilter" as a general option for all file sources [WIP][SPARK-27627][SQL] Make option "pathGlobFilter" as a general option for all file sources May 4, 2019
@SparkQA
Copy link

SparkQA commented May 4, 2019

Test build #105121 has finished for PR 24518 at commit 53e9b6b.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 4, 2019

Test build #105120 has finished for PR 24518 at commit a005b0b.

  • This patch fails due to an unknown error code, -9.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

retest this please

@HyukjinKwon
Copy link
Member

Can we deprecate ignoreExtension? We can just note that option is deprecated somewhere as of pathGlobFilter and users should set pathGlobFilter to .avro.

@SparkQA
Copy link

SparkQA commented May 6, 2019

Test build #105148 has finished for PR 24518 at commit 53e9b6b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gengliangwang gengliangwang changed the title [WIP][SPARK-27627][SQL] Make option "pathGlobFilter" as a general option for all file sources [SPARK-27627][SQL] Make option "pathGlobFilter" as a general option for all file sources May 6, 2019
@gengliangwang
Copy link
Member Author

I think I have addressed all the comments, please review this again.
@gatorsmile @HyukjinKwon @mengxr @dongjoon-hyun @WeichenXu123

</table>

To read whole binary files, you need to specify the data source `format` as `binaryFile`.
For example, the following code reads all PNG files from the input directory:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we keep the pathGlobFilter option in the example? It is actually important for the use case. Just mention pathGlobFilter is a global option.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I will revert this.

@SparkQA
Copy link

SparkQA commented May 7, 2019

Test build #105175 has finished for PR 24518 at commit dcab4f9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 7, 2019

Test build #105179 has finished for PR 24518 at commit ddf1874.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

I am okay with this one.

@SparkQA
Copy link

SparkQA commented May 7, 2019

Test build #105225 has finished for PR 24518 at commit d8f8420.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

Merged to master.

* ``timeZone``: sets the string that indicates a timezone to be used to parse timestamps
in the JSON/CSV datasources or partition values.
If it isn't set, it uses the default value, session local timezone.
* ``pathGlobFilter``: an optional glob pattern to only include files with paths matching
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, actually can we move this documentation to each implementation of CSV, Parquet, ORC, text? It will only work with such internal file based sources.

HyukjinKwon added a commit that referenced this pull request Dec 23, 2019
…p' and 'pathGlobFilter' in file sources 'mergeSchema' in ORC

### What changes were proposed in this pull request?

This PR adds and exposes the options, 'recursiveFileLookup' and 'pathGlobFilter' in file sources 'mergeSchema' in ORC, into documentation.

- `recursiveFileLookup` at file sources: #24830 ([SPARK-27627](https://issues.apache.org/jira/browse/SPARK-27627))
- `pathGlobFilter` at file sources: #24518 ([SPARK-27990](https://issues.apache.org/jira/browse/SPARK-27990))
- `mergeSchema` at ORC: #24043 ([SPARK-11412](https://issues.apache.org/jira/browse/SPARK-11412))

**Note that** `timeZone` option was not moved from `DataFrameReader.options` as I assume it will likely affect other datasources as well once DSv2 is complete.

### Why are the changes needed?

To document available options in sources properly.

### Does this PR introduce any user-facing change?

In PySpark, `pathGlobFilter` can be set via `DataFrameReader.(text|orc|parquet|json|csv)` and `DataStreamReader.(text|orc|parquet|json|csv)`.

### How was this patch tested?

Manually built the doc and checked the output. Option setting in PySpark is rather a logical change. I manually tested one only:

```bash
$ ls -al tmp
...
-rw-r--r--   1 hyukjin.kwon  staff     3 Dec 20 12:19 aa
-rw-r--r--   1 hyukjin.kwon  staff     3 Dec 20 12:19 ab
-rw-r--r--   1 hyukjin.kwon  staff     3 Dec 20 12:19 ac
-rw-r--r--   1 hyukjin.kwon  staff     3 Dec 20 12:19 cc
```

```python
>>> spark.read.text("tmp", pathGlobFilter="*c").show()
```

```
+-----+
|value|
+-----+
|   ac|
|   cc|
+-----+
```

Closes #26958 from HyukjinKwon/doc-followup.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
* If the option is not set, the Hadoop's config `avro.mapred.ignore.inputs.without.extension`
* is taken into account. If the former one is not set too, file extensions are ignored.
*/
@deprecated("Use the general data source option pathGlobFilter for filtering file names", "3.0")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am wondering whom is this deprecation warning to? Spark users don't use ignoreExtension directly. I do think we should print a warning when we read & detect that AvroFileFormat.IgnoreFilesWithoutExtensionProperty and/or AvroOptions.ignoreExtensionKey are set otherwise users will never see the deprecation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

only in one case at schema inferring. I would remove this annotation and print warning in initialization of AvroOptions. The deprecation warning is printed only while Spark compilation which is useless for users.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should remove deprecated.

It would be great if we can put that logic into AvroOptions e.g.:

    parameters
      .get(AvroOptions.ignoreExtensionKey)
      .map { v =>
        logWarning(...)
        v.toBoolean
      }.getOrElse(!ignoreFilesWithoutExtension)

However, can you make it doesn't show the logs too many times? If we put there, seems like it will show the same logs multiple times.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you can find a better way, please go and open a PR (and some nits I picked below)


parameters
.get("ignoreExtension")
.get(AvroOptions.ignoreExtensionKey)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ignoreExtensionKey -> IGNORE_EXTENTION_KEY to be consistent with other XXXOptions

options: Map[String, String],
files: Seq[FileStatus]): Option[StructType] = {
val conf = spark.sessionState.newHadoopConf()
if (options.contains("ignoreExtension")) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"ignoreExtension " -> AvroOptions.ignoreExtensionKey

lwwmanning pushed a commit to palantir/spark that referenced this pull request Jan 9, 2020
…or all file sources

The data source option `pathGlobFilter` is introduced for Binary file format: apache#24354 , which can be used for filtering file names, e.g. reading `.png` files only while there is `.json` files in the same directory.

Make the option `pathGlobFilter` as a general option for all file sources. The path filtering should happen in the path globbing on Driver.

Filtering the file path names in file scan tasks on executors is kind of ugly.

1. The splitting of file partitions will be more balanced.
2. The metrics of file scan will be more accurate.
3. Users can use the option for reading other file sources.

Unit tests

Closes apache#24518 from gengliangwang/globFilter.

Authored-by: Gengliang Wang <gengliang.wang@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants