[SPARK-30475][SQL] File source V2: Push data filters for file listing #27157

guykhazma · 2020-01-09T23:07:18Z

What changes were proposed in this pull request?

Follow up on SPARK-30428 which added support for partition pruning in File source V2.
This PR implements the necessary changes in order to pass the dataFilters to the listFiles. This enables having FileIndex implementations which use the dataFilters for further pruning the file listing (see the discussion here).

Why are the changes needed?

Datasources such as csv and json do not implement the SupportsPushDownFilters trait. In order to support data skipping uniformly for all file based data sources, one can override the listFiles method in a FileIndex implementation, which consults external metadata and prunes the list of files.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Modifying the unit tests for v2 file sources to verify the dataFilters are passed

gengliangwang · 2020-01-09T23:20:59Z

Jenkins, test this please.

gengliangwang

@guykhazma Thanks for working on it.
Two suggestions:

Please create another JIRA. SPARK-30428 is for partition pruning.
Please add more test cases

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/FileScan.scala

guykhazma · 2020-01-10T00:30:33Z

@gengliangwang as for the tests, I have added to the existing tests a check that the dataFilters are indeed passed to the FileScan.
In addition I have added a test which doesn't have partitionFilters so only the dataFilters should be not empty.
Since the current FileIndex (PartitioningAwareFileIndex) is not affected by the dataFilters there is no test that checks any pruning besides the pruning that is done by the partitionFilters.

SparkQA · 2020-01-10T00:40:52Z

Test build #116427 has finished for PR 27157 at commit 1a65933.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

guykhazma · 2020-01-10T06:09:35Z

retest this please

guykhazma · 2020-01-10T09:25:05Z

@gengliangwang I have fixed the tests and added also a test for Avro scan without partitionFilters

guykhazma · 2020-01-10T09:37:29Z

...re/src/main/scala/org/apache/spark/sql/execution/datasources/PruneFileSourcePartitions.scala

+        getPartitionKeyFiltersAndDataFilters(scan.sparkSession, v2Relation,
+          scan.readPartitionSchema, filters, output)
+      // The dataFilters are pushed down only once
+      if (partitionKeyFilters.nonEmpty || (dataFilters.nonEmpty && scan.dataFilters.isEmpty)) {


The reason for the condition

(dataFilters.nonEmpty && scan.dataFilters.isEmpty)

Is that unlike the partitionFilters which are pushed down and don't need to be reevaluated (which will make the partitionKeyFilters.nonEmpty to be false in the next iteration) the dataFilters will remain non empty so scan.dataFilters.isEmpty is needed to make sure we don't get stack overflow.

...re/src/main/scala/org/apache/spark/sql/execution/datasources/PruneFileSourcePartitions.scala

gengliangwang · 2020-01-12T23:48:35Z

@guykhazma sorry but I still have concerns about this PR.
Could you give an example of "data skipping uniformly for all file based data sources" in the comments #27112 (comment)

gengliangwang · 2020-01-12T23:49:07Z

retest this please.

SparkQA · 2020-01-13T03:50:21Z

Test build #116575 has finished for PR 27157 at commit 8ab97db.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

guykhazma · 2020-01-13T07:48:51Z

@gengliangwang by "data skipping uniformly for all file based data sources" I mean that the above approach works uniformly for all formats whether they support pushdown or not.
(It has also benefits for formats which support pushdown such as parquet by avoiding the need to read the footer of each file). See for example this Spark Summit talk.

Note that in datasource v1 the dataFilters are also passed to the listFiles method in the FileSourceScanExec case class which is used by all of the file based datasources.

guykhazma · 2020-01-15T07:35:03Z

@gengliangwang see also this PR which originally added the dataFilters to the list files method.

guykhazma · 2020-01-17T22:51:22Z

@gengliangwang @cloud-fan can you please review this PR.

external/avro/src/test/scala/org/apache/spark/sql/avro/AvroSuite.scala

gengliangwang · 2020-01-18T00:50:45Z

@guykhazma Sorry to reply late.
I was thinking about another approach, but I can't come up with a better one yet.

My major concern is that the filters are supposed to be pushed down in the FileScanBuilder. It is wired to push down again for in the FileScan. Technically, the partition filters should be pushed down in FileScanBuilder as well.
However, the current DSV2 API exposes the filters as Filter only instead of Expression. The coverage of Filter is limited. That's why I push the partition filters into FileScan in #27112.

Keeping the behavior in V2 is also important. I will merge this one. We can improve the approach in the future.

guykhazma · 2020-01-18T06:00:18Z

@gengliangwang thanks for reviewing.
I agree with your concern, this can be improved in subsequent PRs which will require a broader change in the V2 File based DataSources and v2 API. I'll be glad to help with that.

gengliangwang · 2020-01-20T22:48:15Z

retest this please.

SparkQA · 2020-01-21T02:46:55Z

Test build #117138 has finished for PR 27157 at commit d181e38.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gengliangwang · 2020-01-21T04:20:14Z

Thanks, merging to master.

guykhazma · 2020-01-21T05:21:03Z

@gengliangwang thanks for reviewing and merging!

### What changes were proposed in this pull request? This bug was introduced by SPARK-30428 at Apache Spark 3.0.0. This PR fixes `FileScan.equals()`. ### Why are the changes needed? - Without this fix `FileScan.equals` doesn't take `fileIndex` and `readSchema` into account. - Partition filters and data filters added to `FileScan` (in #27112 and #27157) caused that canonicalized form of some `BatchScanExec` nodes don't match and this prevents some reuse possibilities. ### Does this PR introduce _any_ user-facing change? Yes, before this fix incorrect reuse of `FileScan` and so `BatchScanExec` could have happed causing correctness issues. ### How was this patch tested? Added new UTs. Closes #31848 from peter-toth/SPARK-34756-fix-filescan-equality-check. Authored-by: Peter Toth <peter.toth@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

### What changes were proposed in this pull request? This bug was introduced by SPARK-30428 at Apache Spark 3.0.0. This PR fixes `FileScan.equals()`. ### Why are the changes needed? - Without this fix `FileScan.equals` doesn't take `fileIndex` and `readSchema` into account. - Partition filters and data filters added to `FileScan` (in #27112 and #27157) caused that canonicalized form of some `BatchScanExec` nodes don't match and this prevents some reuse possibilities. ### Does this PR introduce _any_ user-facing change? Yes, before this fix incorrect reuse of `FileScan` and so `BatchScanExec` could have happed causing correctness issues. ### How was this patch tested? Added new UTs. Closes #31848 from peter-toth/SPARK-34756-fix-filescan-equality-check. Authored-by: Peter Toth <peter.toth@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 93a5d34) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

This bug was introduced by SPARK-30428 at Apache Spark 3.0.0. This PR fixes `FileScan.equals()`. - Without this fix `FileScan.equals` doesn't take `fileIndex` and `readSchema` into account. - Partition filters and data filters added to `FileScan` (in #27112 and #27157) caused that canonicalized form of some `BatchScanExec` nodes don't match and this prevents some reuse possibilities. Yes, before this fix incorrect reuse of `FileScan` and so `BatchScanExec` could have happed causing correctness issues. Added new UTs. Closes #31848 from peter-toth/SPARK-34756-fix-filescan-equality-check. Authored-by: Peter Toth <peter.toth@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 93a5d34) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

This bug was introduced by SPARK-30428 at Apache Spark 3.0.0. This PR fixes `FileScan.equals()`. - Without this fix `FileScan.equals` doesn't take `fileIndex` and `readSchema` into account. - Partition filters and data filters added to `FileScan` (in apache#27112 and apache#27157) caused that canonicalized form of some `BatchScanExec` nodes don't match and this prevents some reuse possibilities. Yes, before this fix incorrect reuse of `FileScan` and so `BatchScanExec` could have happed causing correctness issues. Added new UTs. Closes apache#31848 from peter-toth/SPARK-34756-fix-filescan-equality-check. Authored-by: Peter Toth <peter.toth@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 93a5d34) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

### What changes were proposed in this pull request? This bug was introduced by SPARK-30428 at Apache Spark 3.0.0. This PR fixes `FileScan.equals()`. ### Why are the changes needed? - Without this fix `FileScan.equals` doesn't take `fileIndex` and `readSchema` into account. - Partition filters and data filters added to `FileScan` (in #27112 and #27157) caused that canonicalized form of some `BatchScanExec` nodes don't match and this prevents some reuse possibilities. ### Does this PR introduce _any_ user-facing change? Yes, before this fix incorrect reuse of `FileScan` and so `BatchScanExec` could have happed causing correctness issues. ### How was this patch tested? Added new UTs. Closes #31952 from peter-toth/SPARK-34756-fix-filescan-equality-check-3.0. Authored-by: Peter Toth <peter.toth@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

### What changes were proposed in this pull request? This bug was introduced by SPARK-30428 at Apache Spark 3.0.0. This PR fixes `FileScan.equals()`. ### Why are the changes needed? - Without this fix `FileScan.equals` doesn't take `fileIndex` and `readSchema` into account. - Partition filters and data filters added to `FileScan` (in apache#27112 and apache#27157) caused that canonicalized form of some `BatchScanExec` nodes don't match and this prevents some reuse possibilities. ### Does this PR introduce _any_ user-facing change? Yes, before this fix incorrect reuse of `FileScan` and so `BatchScanExec` could have happed causing correctness issues. ### How was this patch tested? Added new UTs. Closes apache#31848 from peter-toth/SPARK-34756-fix-filescan-equality-check. Authored-by: Peter Toth <peter.toth@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 93a5d34) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

### What changes were proposed in this pull request? Follow up on [SPARK-30428](apache#27112) which added support for partition pruning in File source V2. This PR implements the necessary changes in order to pass the `dataFilters` to the `listFiles`. This enables having `FileIndex` implementations which use the `dataFilters` for further pruning the file listing (see the discussion [here](apache#27112 (comment))). ### Why are the changes needed? Datasources such as `csv` and `json` do not implement the `SupportsPushDownFilters` trait. In order to support data skipping uniformly for all file based data sources, one can override the `listFiles` method in a `FileIndex` implementation, which consults external metadata and prunes the list of files. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Modifying the unit tests for v2 file sources to verify the `dataFilters` are passed Closes apache#27157 from guykhazma/PushdataFiltersInFileListing. Authored-by: Guy Khazma <guykhag@gmail.com> Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com> # Conflicts: # external/avro/src/main/scala/org/apache/spark/sql/v2/avro/AvroScan.scala # external/avro/src/test/scala/org/apache/spark/sql/avro/AvroSuite.scala # sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PruneFileSourcePartitions.scala # sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/FileScan.scala # sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/csv/CSVScan.scala # sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/json/JsonScan.scala # sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/orc/OrcScan.scala # sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/parquet/ParquetScan.scala # sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/text/TextScan.scala # sql/core/src/test/scala/org/apache/spark/sql/FileBasedDataSourceSuite.scala

guykhazma added 2 commits January 10, 2020 00:38

add support for pushing data filters to file listing

eeca939

modify tests

1a65933

guykhazma mentioned this pull request Jan 9, 2020

[SPARK-30428][SQL] File source V2: support partition pruning #27112

Closed

gengliangwang requested changes Jan 9, 2020

View reviewed changes

gengliangwang reviewed Jan 9, 2020

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/FileScan.scala Show resolved Hide resolved

guykhazma changed the title ~~[SPARK-30428][SQL][FOLLOWUP] File source V2: Push data filters for file listing~~ [SPARK-30475][SQL] File source V2: Push data filters for file listing Jan 9, 2020

guykhazma added 3 commits January 10, 2020 02:22

fix per review

67d501a

add test

d056350

minor comment update

689199b

guykhazma added 2 commits January 10, 2020 10:27

Merge branch 'master' into PushdataFiltersInFileListing

8f63db1

minor fix + fix failing tests and adding avro test

0915b54

minor fix

8ab97db

guykhazma commented Jan 10, 2020

View reviewed changes

gengliangwang reviewed Jan 12, 2020

View reviewed changes

...re/src/main/scala/org/apache/spark/sql/execution/datasources/PruneFileSourcePartitions.scala Outdated Show resolved Hide resolved

fix per comment

3fe4dc4

Merge branch 'master' into PushdataFiltersInFileListing

b1619b8

guykhazma requested a review from gengliangwang January 16, 2020 16:03

gengliangwang reviewed Jan 18, 2020

View reviewed changes

external/avro/src/test/scala/org/apache/spark/sql/avro/AvroSuite.scala Outdated Show resolved Hide resolved

gengliangwang reviewed Jan 18, 2020

View reviewed changes

external/avro/src/test/scala/org/apache/spark/sql/avro/AvroSuite.scala Outdated Show resolved Hide resolved

fix avro reader

d181e38

gengliangwang closed this in 2d59ca4 Jan 21, 2020

gengliangwang mentioned this pull request Jan 30, 2020

[SPARK-30516][SQL] involve partition filters in the statistic estimation of FileScan #27213

Closed

dongjoon-hyun added the SQL label Feb 5, 2020

This was referenced Mar 12, 2021

[SPARK-33482][SQL] Fix FileScan canonicalization #31820

Closed

[SPARK-33482][SPARK-34756][SQL] Fix FileScan equality check #31848

Closed

peter-toth mentioned this pull request Mar 24, 2021

[SPARK-33482][SPARK-34756][SQL][3.0] Fix FileScan equality check #31952

Closed

[SPARK-30475][SQL] File source V2: Push data filters for file listing #27157

[SPARK-30475][SQL] File source V2: Push data filters for file listing #27157

Uh oh!

Conversation

guykhazma commented Jan 9, 2020

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

gengliangwang commented Jan 9, 2020

Uh oh!

gengliangwang left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

guykhazma commented Jan 10, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented Jan 10, 2020

Uh oh!

guykhazma commented Jan 10, 2020

Uh oh!

guykhazma commented Jan 10, 2020

Uh oh!

guykhazma Jan 10, 2020

Choose a reason for hiding this comment

Uh oh!

Uh oh!

gengliangwang commented Jan 12, 2020

Uh oh!

gengliangwang commented Jan 12, 2020

Uh oh!

SparkQA commented Jan 13, 2020

Uh oh!

guykhazma commented Jan 13, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

guykhazma commented Jan 15, 2020

Uh oh!

guykhazma commented Jan 17, 2020

Uh oh!

Uh oh!

Uh oh!

gengliangwang commented Jan 18, 2020

Uh oh!

guykhazma commented Jan 18, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gengliangwang commented Jan 20, 2020

Uh oh!

SparkQA commented Jan 21, 2020

Uh oh!

gengliangwang commented Jan 21, 2020

Uh oh!

guykhazma commented Jan 21, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

guykhazma commented Jan 10, 2020 •

edited

Loading

guykhazma commented Jan 13, 2020 •

edited

Loading

guykhazma commented Jan 18, 2020 •

edited

Loading