[SQL][SPARK-39528] Use V2 Filter in SupportsRuntimeFiltering #36918

huaxingao · 2022-06-20T05:58:26Z

What changes were proposed in this pull request?

Use V2 Filter in run time filtering for V2 Table

Why are the changes needed?

We should use V2 Filter in DS V2.
#32921 (comment)

Does this PR introduce any user-facing change?

Yes
new interface SupportsRuntimeV2Filtering

How was this patch tested?

new test suite

zinking · 2022-06-21T02:24:35Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/BatchScanExec.scala

what about match here?

Changed. Thanks

huaxingao · 2022-07-20T06:52:37Z

cc @cloud-fan Could you please take a look when you have a moment? Thanks!

cloud-fan · 2022-07-20T07:56:06Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/BatchScanExec.scala

+        scan match {
+          case _: SupportsRuntimeFiltering =>
+            DataSourceStrategy.translateRuntimeFilter(e)
+          case _: SupportsRuntimeV2Filtering =>


shall we make SupportsRuntimeV2Filtering have higher priority over SupportsRuntimeFiltering? Also we need to document the behavior if a source implements both of them

It doesn't seem to me that a data source would implement both SupportsRuntimeV2Filtering and SupportsRuntimeFiltering?

cloud-fan · 2022-07-20T07:58:18Z

...core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Strategy.scala

+      }
+      val literals = values.map { value =>
+        val literal = Literal(value)
+        LiteralValue(literal.value, literal.dataType)


We don't need to infer the data type by creating a catalyst Literal. The type must be in.child.dataType

Fixed. Thanks

sql/core/src/main/scala/org/apache/spark/sql/execution/dynamicpruning/PartitionPruning.scala

zinking · 2022-07-20T10:33:43Z

...talyst/src/test/scala/org/apache/spark/sql/connector/catalog/InMemoryTableWithV2Filter.scala

+      if (partitioning.length == 1 && partitioning.head.references().length == 1) {
+        val ref = partitioning.head.references().head
+        filters.foreach {
+          case p : Predicate if p.name().equals("IN") =>


feels like some unapply method to extract what you want is more preferable

Predicate is a java class. I don't think unapply can be used

huaxingao · 2022-07-20T23:56:14Z

The test failure is unrelated.

sql/catalyst/src/main/java/org/apache/spark/sql/connector/read/SupportsRuntimeFiltering.java

sql/catalyst/src/main/scala/org/apache/spark/sql/util/PredicateUtils.scala

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/connector/PredicateUtils.scala

cloud-fan · 2022-07-28T13:18:51Z

The GA failure is unrelated. Merging to master, thanks!

huaxingao · 2022-07-28T14:43:53Z

Thanks @cloud-fan @zinking

cloud-fan · 2022-08-24T03:23:01Z

sql/core/src/test/scala/org/apache/spark/sql/DynamicPartitionPruningSuite.scala

  with EnableAdaptiveExecutionSuite
+
+abstract class DynamicPartitionPruningV2FilterSuite
+    extends DynamicPartitionPruningDataSourceSuiteBase {


shall we extend DynamicPartitionPruningV2Suite here? then we can save the override protected def runAnalyzeColumnCommands: Boolean = false, and catalog configs will be overwritten.

Sounds good. I have a follow-up here

LorenzoMartini · 2023-01-10T11:35:38Z

Hi @huaxingao.

We are trying to use spark datasourceV2 and noticed that the spark v2 built-in data sources (eg parquet one, looking at ParquetScan) don't support this (SupportsRuntimeFiltering nor SupportsRuntimeV2Filtering) by default, creating a large performance difference between using v1 and v2 datasource ootb.

Is there a plan to have them support this? It would be really beneficial for the file scans to be able to do this and given they already benefit of some push downs we were wondering why the runtime filtering is not implemented. Or maybe I am missing something? And in that case it would be great to understand how to have spark file sources take advantage of dpp.

Thanks!

github-actions bot added the SQL label Jun 20, 2022

huaxingao changed the title ~~Support runtime V2 filtering~~ [SQL][SPARK-39528] Use V2 Filter in SupportsRuntimeFiltering Jun 20, 2022

zinking reviewed Jun 21, 2022

View reviewed changes

huaxingao added 3 commits July 19, 2022 21:41

Support runtime V2 filtering

cc62244

fix style

25413ff

use pattern matching instead of if-else

66b4370

huaxingao force-pushed the v2filtering branch from faf1ee2 to 66b4370 Compare July 20, 2022 05:26

cloud-fan reviewed Jul 20, 2022

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/dynamicpruning/PartitionPruning.scala Show resolved Hide resolved

zinking reviewed Jul 20, 2022

View reviewed changes

address comments

6157001

huaxingao added 3 commits July 23, 2022 11:25

address comments

1aadfc6

fix java doc build failure

277222f

address comments

6221609

cloud-fan reviewed Jul 27, 2022

View reviewed changes

sql/catalyst/src/main/java/org/apache/spark/sql/connector/read/SupportsRuntimeFiltering.java Outdated Show resolved Hide resolved

cloud-fan reviewed Jul 27, 2022

View reviewed changes

sql/catalyst/src/main/java/org/apache/spark/sql/connector/read/SupportsRuntimeFiltering.java Outdated Show resolved Hide resolved

cloud-fan reviewed Jul 27, 2022

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/util/PredicateUtils.scala Show resolved Hide resolved

cloud-fan reviewed Jul 27, 2022

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/util/PredicateUtils.scala Outdated Show resolved Hide resolved

cloud-fan reviewed Jul 27, 2022

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/util/PredicateUtils.scala Outdated Show resolved Hide resolved

address comments

9e0799f