-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-30475][SQL] File source V2: Push data filters for file listing #27157
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Jenkins, test this please. |
gengliangwang
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@guykhazma Thanks for working on it.
Two suggestions:
- Please create another JIRA. SPARK-30428 is for partition pruning.
- Please add more test cases
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/FileScan.scala
Show resolved
Hide resolved
|
@gengliangwang as for the tests, I have added to the existing tests a check that the |
|
Test build #116427 has finished for PR 27157 at commit
|
|
retest this please |
|
@gengliangwang I have fixed the tests and added also a test for Avro scan without |
| getPartitionKeyFiltersAndDataFilters(scan.sparkSession, v2Relation, | ||
| scan.readPartitionSchema, filters, output) | ||
| // The dataFilters are pushed down only once | ||
| if (partitionKeyFilters.nonEmpty || (dataFilters.nonEmpty && scan.dataFilters.isEmpty)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason for the condition
(dataFilters.nonEmpty && scan.dataFilters.isEmpty)
Is that unlike the partitionFilters which are pushed down and don't need to be reevaluated (which will make the partitionKeyFilters.nonEmpty to be false in the next iteration) the dataFilters will remain non empty so scan.dataFilters.isEmpty is needed to make sure we don't get stack overflow.
...re/src/main/scala/org/apache/spark/sql/execution/datasources/PruneFileSourcePartitions.scala
Outdated
Show resolved
Hide resolved
|
@guykhazma sorry but I still have concerns about this PR. |
|
retest this please. |
|
Test build #116575 has finished for PR 27157 at commit
|
|
@gengliangwang by Note that in datasource v1 the |
|
@gengliangwang see also this PR which originally added the |
|
@gengliangwang @cloud-fan can you please review this PR. |
external/avro/src/test/scala/org/apache/spark/sql/avro/AvroSuite.scala
Outdated
Show resolved
Hide resolved
external/avro/src/test/scala/org/apache/spark/sql/avro/AvroSuite.scala
Outdated
Show resolved
Hide resolved
|
@guykhazma Sorry to reply late. My major concern is that the filters are supposed to be pushed down in the Keeping the behavior in V2 is also important. I will merge this one. We can improve the approach in the future. |
|
@gengliangwang thanks for reviewing. |
|
retest this please. |
|
Test build #117138 has finished for PR 27157 at commit
|
|
Thanks, merging to master. |
|
@gengliangwang thanks for reviewing and merging! |
### What changes were proposed in this pull request? This bug was introduced by SPARK-30428 at Apache Spark 3.0.0. This PR fixes `FileScan.equals()`. ### Why are the changes needed? - Without this fix `FileScan.equals` doesn't take `fileIndex` and `readSchema` into account. - Partition filters and data filters added to `FileScan` (in #27112 and #27157) caused that canonicalized form of some `BatchScanExec` nodes don't match and this prevents some reuse possibilities. ### Does this PR introduce _any_ user-facing change? Yes, before this fix incorrect reuse of `FileScan` and so `BatchScanExec` could have happed causing correctness issues. ### How was this patch tested? Added new UTs. Closes #31848 from peter-toth/SPARK-34756-fix-filescan-equality-check. Authored-by: Peter Toth <peter.toth@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request? This bug was introduced by SPARK-30428 at Apache Spark 3.0.0. This PR fixes `FileScan.equals()`. ### Why are the changes needed? - Without this fix `FileScan.equals` doesn't take `fileIndex` and `readSchema` into account. - Partition filters and data filters added to `FileScan` (in #27112 and #27157) caused that canonicalized form of some `BatchScanExec` nodes don't match and this prevents some reuse possibilities. ### Does this PR introduce _any_ user-facing change? Yes, before this fix incorrect reuse of `FileScan` and so `BatchScanExec` could have happed causing correctness issues. ### How was this patch tested? Added new UTs. Closes #31848 from peter-toth/SPARK-34756-fix-filescan-equality-check. Authored-by: Peter Toth <peter.toth@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 93a5d34) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
This bug was introduced by SPARK-30428 at Apache Spark 3.0.0. This PR fixes `FileScan.equals()`. - Without this fix `FileScan.equals` doesn't take `fileIndex` and `readSchema` into account. - Partition filters and data filters added to `FileScan` (in #27112 and #27157) caused that canonicalized form of some `BatchScanExec` nodes don't match and this prevents some reuse possibilities. Yes, before this fix incorrect reuse of `FileScan` and so `BatchScanExec` could have happed causing correctness issues. Added new UTs. Closes #31848 from peter-toth/SPARK-34756-fix-filescan-equality-check. Authored-by: Peter Toth <peter.toth@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 93a5d34) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
This bug was introduced by SPARK-30428 at Apache Spark 3.0.0. This PR fixes `FileScan.equals()`. - Without this fix `FileScan.equals` doesn't take `fileIndex` and `readSchema` into account. - Partition filters and data filters added to `FileScan` (in apache#27112 and apache#27157) caused that canonicalized form of some `BatchScanExec` nodes don't match and this prevents some reuse possibilities. Yes, before this fix incorrect reuse of `FileScan` and so `BatchScanExec` could have happed causing correctness issues. Added new UTs. Closes apache#31848 from peter-toth/SPARK-34756-fix-filescan-equality-check. Authored-by: Peter Toth <peter.toth@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 93a5d34) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request? This bug was introduced by SPARK-30428 at Apache Spark 3.0.0. This PR fixes `FileScan.equals()`. ### Why are the changes needed? - Without this fix `FileScan.equals` doesn't take `fileIndex` and `readSchema` into account. - Partition filters and data filters added to `FileScan` (in #27112 and #27157) caused that canonicalized form of some `BatchScanExec` nodes don't match and this prevents some reuse possibilities. ### Does this PR introduce _any_ user-facing change? Yes, before this fix incorrect reuse of `FileScan` and so `BatchScanExec` could have happed causing correctness issues. ### How was this patch tested? Added new UTs. Closes #31952 from peter-toth/SPARK-34756-fix-filescan-equality-check-3.0. Authored-by: Peter Toth <peter.toth@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request? This bug was introduced by SPARK-30428 at Apache Spark 3.0.0. This PR fixes `FileScan.equals()`. ### Why are the changes needed? - Without this fix `FileScan.equals` doesn't take `fileIndex` and `readSchema` into account. - Partition filters and data filters added to `FileScan` (in apache#27112 and apache#27157) caused that canonicalized form of some `BatchScanExec` nodes don't match and this prevents some reuse possibilities. ### Does this PR introduce _any_ user-facing change? Yes, before this fix incorrect reuse of `FileScan` and so `BatchScanExec` could have happed causing correctness issues. ### How was this patch tested? Added new UTs. Closes apache#31848 from peter-toth/SPARK-34756-fix-filescan-equality-check. Authored-by: Peter Toth <peter.toth@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 93a5d34) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request? This bug was introduced by SPARK-30428 at Apache Spark 3.0.0. This PR fixes `FileScan.equals()`. ### Why are the changes needed? - Without this fix `FileScan.equals` doesn't take `fileIndex` and `readSchema` into account. - Partition filters and data filters added to `FileScan` (in apache#27112 and apache#27157) caused that canonicalized form of some `BatchScanExec` nodes don't match and this prevents some reuse possibilities. ### Does this PR introduce _any_ user-facing change? Yes, before this fix incorrect reuse of `FileScan` and so `BatchScanExec` could have happed causing correctness issues. ### How was this patch tested? Added new UTs. Closes apache#31848 from peter-toth/SPARK-34756-fix-filescan-equality-check. Authored-by: Peter Toth <peter.toth@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 93a5d34) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request? Follow up on [SPARK-30428](apache#27112) which added support for partition pruning in File source V2. This PR implements the necessary changes in order to pass the `dataFilters` to the `listFiles`. This enables having `FileIndex` implementations which use the `dataFilters` for further pruning the file listing (see the discussion [here](apache#27112 (comment))). ### Why are the changes needed? Datasources such as `csv` and `json` do not implement the `SupportsPushDownFilters` trait. In order to support data skipping uniformly for all file based data sources, one can override the `listFiles` method in a `FileIndex` implementation, which consults external metadata and prunes the list of files. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Modifying the unit tests for v2 file sources to verify the `dataFilters` are passed Closes apache#27157 from guykhazma/PushdataFiltersInFileListing. Authored-by: Guy Khazma <guykhag@gmail.com> Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com> # Conflicts: # external/avro/src/main/scala/org/apache/spark/sql/v2/avro/AvroScan.scala # external/avro/src/test/scala/org/apache/spark/sql/avro/AvroSuite.scala # sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PruneFileSourcePartitions.scala # sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/FileScan.scala # sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/csv/CSVScan.scala # sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/json/JsonScan.scala # sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/orc/OrcScan.scala # sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/parquet/ParquetScan.scala # sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/text/TextScan.scala # sql/core/src/test/scala/org/apache/spark/sql/FileBasedDataSourceSuite.scala
### What changes were proposed in this pull request? Follow up on [SPARK-30428](apache#27112) which added support for partition pruning in File source V2. This PR implements the necessary changes in order to pass the `dataFilters` to the `listFiles`. This enables having `FileIndex` implementations which use the `dataFilters` for further pruning the file listing (see the discussion [here](apache#27112 (comment))). ### Why are the changes needed? Datasources such as `csv` and `json` do not implement the `SupportsPushDownFilters` trait. In order to support data skipping uniformly for all file based data sources, one can override the `listFiles` method in a `FileIndex` implementation, which consults external metadata and prunes the list of files. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Modifying the unit tests for v2 file sources to verify the `dataFilters` are passed Closes apache#27157 from guykhazma/PushdataFiltersInFileListing. Authored-by: Guy Khazma <guykhag@gmail.com> Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com> # Conflicts: # external/avro/src/main/scala/org/apache/spark/sql/v2/avro/AvroScan.scala # external/avro/src/test/scala/org/apache/spark/sql/avro/AvroSuite.scala # sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PruneFileSourcePartitions.scala # sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/FileScan.scala # sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/csv/CSVScan.scala # sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/json/JsonScan.scala # sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/orc/OrcScan.scala # sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/parquet/ParquetScan.scala # sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/text/TextScan.scala # sql/core/src/test/scala/org/apache/spark/sql/FileBasedDataSourceSuite.scala
### What changes were proposed in this pull request? Follow up on [SPARK-30428](apache#27112) which added support for partition pruning in File source V2. This PR implements the necessary changes in order to pass the `dataFilters` to the `listFiles`. This enables having `FileIndex` implementations which use the `dataFilters` for further pruning the file listing (see the discussion [here](apache#27112 (comment))). ### Why are the changes needed? Datasources such as `csv` and `json` do not implement the `SupportsPushDownFilters` trait. In order to support data skipping uniformly for all file based data sources, one can override the `listFiles` method in a `FileIndex` implementation, which consults external metadata and prunes the list of files. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Modifying the unit tests for v2 file sources to verify the `dataFilters` are passed Closes apache#27157 from guykhazma/PushdataFiltersInFileListing. Authored-by: Guy Khazma <guykhag@gmail.com> Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com> # Conflicts: # external/avro/src/main/scala/org/apache/spark/sql/v2/avro/AvroScan.scala # external/avro/src/test/scala/org/apache/spark/sql/avro/AvroSuite.scala # sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PruneFileSourcePartitions.scala # sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/FileScan.scala # sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/csv/CSVScan.scala # sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/json/JsonScan.scala # sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/orc/OrcScan.scala # sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/parquet/ParquetScan.scala # sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/text/TextScan.scala # sql/core/src/test/scala/org/apache/spark/sql/FileBasedDataSourceSuite.scala
### What changes were proposed in this pull request? Follow up on [SPARK-30428](apache#27112) which added support for partition pruning in File source V2. This PR implements the necessary changes in order to pass the `dataFilters` to the `listFiles`. This enables having `FileIndex` implementations which use the `dataFilters` for further pruning the file listing (see the discussion [here](apache#27112 (comment))). ### Why are the changes needed? Datasources such as `csv` and `json` do not implement the `SupportsPushDownFilters` trait. In order to support data skipping uniformly for all file based data sources, one can override the `listFiles` method in a `FileIndex` implementation, which consults external metadata and prunes the list of files. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Modifying the unit tests for v2 file sources to verify the `dataFilters` are passed Closes apache#27157 from guykhazma/PushdataFiltersInFileListing. Authored-by: Guy Khazma <guykhag@gmail.com> Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com> # Conflicts: # external/avro/src/main/scala/org/apache/spark/sql/v2/avro/AvroScan.scala # external/avro/src/test/scala/org/apache/spark/sql/avro/AvroSuite.scala # sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PruneFileSourcePartitions.scala # sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/FileScan.scala # sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/csv/CSVScan.scala # sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/json/JsonScan.scala # sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/orc/OrcScan.scala # sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/parquet/ParquetScan.scala # sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/text/TextScan.scala # sql/core/src/test/scala/org/apache/spark/sql/FileBasedDataSourceSuite.scala
What changes were proposed in this pull request?
Follow up on SPARK-30428 which added support for partition pruning in File source V2.
This PR implements the necessary changes in order to pass the
dataFiltersto thelistFiles. This enables havingFileIndeximplementations which use thedataFiltersfor further pruning the file listing (see the discussion here).Why are the changes needed?
Datasources such as
csvandjsondo not implement theSupportsPushDownFilterstrait. In order to support data skipping uniformly for all file based data sources, one can override thelistFilesmethod in aFileIndeximplementation, which consults external metadata and prunes the list of files.Does this PR introduce any user-facing change?
No
How was this patch tested?
Modifying the unit tests for v2 file sources to verify the
dataFiltersare passed