[SPARK-39148][SQL] DS V2 aggregate push down can work with OFFSET or LIMIT#37195

Closed

cloud-fan wants to merge 3 commits intoapache:masterfrom

Contributor

cloud-fan commented Jul 14, 2022 •

edited

Loading

What changes were proposed in this pull request?

This PR refactors the v2 agg pushdown code. The main change is, now we don't build the Scan immediately when pushing agg. We did it so before because we want to know the data schema with agg pushed, then we can add cast when rewriting the query plan after pushdown. But the problem is, we build Scan too early and can't push down any more operators, while it's common to see LIMIT/OFFSET after agg.

The idea of the refactor is, we don't need to know the data schema with agg pushed. We just give an expectation (the data type should be the same of Spark agg functions), use it to define the output of ScanBuilderHolder, and then rewrite the query plan. Later on, when we build the Scan and replace ScanBuilderHolder with DataSourceV2ScanRelation, we check the actual data schema and add a Project to do type cast if necessary.

Why are the changes needed?

support pushing down LIMIT/OFFSET after agg.

Does this PR introduce any user-facing change?

no

How was this patch tested?

updated tests

github-actions bot added the SQL label


          DS V2 aggregate push down can work with OFFSET or LIMIT

ec5e380

cloud-fan force-pushed the agg branch from c749f4b to ec5e380 Compare

July 15, 2022 14:15

cloud-fan changed the title ~~WIP~~ [SPARK-39148][SQL] DS V2 aggregate push down can work with OFFSET or LIMIT

beliefer reviewed

View reviewed changes

...re/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2ScanRelationPushDown.scala Outdated

+                }
+                private def rewriteAggregate(agg: Aggregate): LogicalPlan = agg.child match {
+                  case ScanOperation(project, Nil, holder@ScanBuilderHolder(_, _, r: SupportsPushDownAggregates))

Contributor

beliefer Jul 18, 2022

Suggested change

      
                case ScanOperation(project, Nil, holder@ScanBuilderHolder(_, _, r: SupportsPushDownAggregates))
          
                case ScanOperation(project, Nil, holder @ ScanBuilderHolder(_, _, r: SupportsPushDownAggregates))

...re/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2ScanRelationPushDown.scala Outdated

Comment on lines 190 to 196

+                    val newOutput = normalizedGroupingExpr.zipWithIndex.map { case (e, i) =>
+                      AttributeReference(s"group_col_$i", e.dataType)()
+                    } ++ finalAggExprs.zipWithIndex.map { case (e, i) =>
+                      AttributeReference(s"agg_func_$i", e.dataType)()
+                    }
+                    val groupOutput = newOutput.take(normalizedGroupingExpr.length)
+                    val aggOutput = newOutput.drop(normalizedGroupingExpr.length)

Contributor

beliefer Jul 18, 2022

Suggested change

      
                  val newOutput = normalizedGroupingExpr.zipWithIndex.map { case (e, i) =>
          
                    AttributeReference(s"group_col_$i", e.dataType)()
          
                  } ++ finalAggExprs.zipWithIndex.map { case (e, i) =>
          
                    AttributeReference(s"agg_func_$i", e.dataType)()
          
                  }
          
                  val groupOutput = newOutput.take(normalizedGroupingExpr.length)
          
                  val aggOutput = newOutput.drop(normalizedGroupingExpr.length)
          
                  val groupOutput = normalizedGroupingExpr.zipWithIndex.map { case (e, i) =>
          
                    AttributeReference(s"group_col_$i", e.dataType)()
          
                  }
          
                  val aggOutput = finalAggExprs.zipWithIndex.map { case (e, i) =>
          
                    AttributeReference(s"agg_func_$i", e.dataType)()
          
                  }
          
                  val newOutput = groupOutput ++ aggOutput

Contributor

beliefer Jul 18, 2022

Good decoupling !

...re/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2ScanRelationPushDown.scala

+                         |Pushed Aggregate Functions:
+                         | ${translatedAgg.aggregateExpressions().mkString(", ")}
+                         |Pushed Group by:
+                         | ${translatedAgg.groupByExpressions.mkString(", ")}

Contributor

beliefer Jul 18, 2022

It seems we not display Output information here now.

Contributor Author

cloud-fan Jul 18, 2022

Now this is completed inferred by Spark, according to the pushed agg exprs and group by exprs.

Contributor

beliefer Jul 18, 2022

Got it.

...re/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2ScanRelationPushDown.scala

+                      // The data source may return columns with arbitrary data types and it's safer to cast them
+                      // to the expected data type.
+                      assert(Cast.canCast(a1.dataType, a2.dataType))
+                      Alias(addCastIfNeeded(a1, a2.dataType), a2.name)(a2.exprId)

Contributor

beliefer Jul 18, 2022

Good decoupling !


          address comment

e6b41fa

Contributor Author

cloud-fan commented Jul 18, 2022

cc @huaxingao @viirya

viirya reviewed

View reviewed changes

...re/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2ScanRelationPushDown.scala

+                      if (r.supportCompletePushDown(translatedAggOpt.get)) {
+                        (actualResultExprs, normalizedAggExprs, translatedAggOpt.get, true)
+                      } else if (!translatedAggOpt.get.aggregateExpressions().exists(_.isInstanceOf[Avg])) {
+                        (actualResultExprs, normalizedAggExprs, translatedAggOpt.get, false)

Member

viirya Jul 18, 2022

Why for this case canCompletePushDown is false? Previously it is treated like supportCompletePushDown, no?

Contributor Author

cloud-fan Jul 18, 2022

It was false before this PR as well. The previous code will invoke r.supportCompletePushDown one more time which returns false.

The logic is also quite clear:

the Aggregate can't be completely pushed
the Aggregate can't be rewritten (avg -> sum/count)

The above two means this Aggregate can't be completely pushed.

Member

viirya Jul 19, 2022

Got it. canCompletePushDown is false but it's still possible to be partial pushdown. It will be checked later.

viirya reviewed

View reviewed changes

...re/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2ScanRelationPushDown.scala

Comment on lines +413 to +414

		if sHolder.pushedAggregate.isEmpty && filter.isEmpty &&
		CollapseProject.canCollapseExpressions(order, project, alwaysInline = true) =>

Member

viirya Jul 18, 2022

Hmm, is this regression? I.e. previously we can pushdown, but now we cannot?

Contributor Author

cloud-fan Jul 18, 2022

Previously we can't push down either. After agg pushdown, Scan is already built, so limit/offset/topN pushdown won't apply anyway.

Member

viirya Jul 19, 2022

I see. I was confused by the added comment and thought it was supported before.

huaxingao reviewed

View reviewed changes

...re/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2ScanRelationPushDown.scala

+                    // SELECT min(c1), max(c1) FROM t GROUP BY c2;
+                    // Use c2, min(c1), max(c1) as output for DataSourceV2ScanRelation
+                    // We want to have the following logical plan:
+                    // == Optimized Logical Plan ==

Contributor

huaxingao Jul 19, 2022

I think this example is not accurate any more. Do we need to update this?

viirya approved these changes

View reviewed changes


          Update V2ScanRelationPushDown.scala

66ed3f7

huaxingao approved these changes

View reviewed changes

Contributor Author

cloud-fan commented Jul 20, 2022

thanks for review, merging to master!

cloud-fan closed this in

aec7953

beliefer mentioned this pull request

[SPARK-39148][SQL] DS V2 aggregate push down can work with OFFSET or LIMIT #37001

Closed

chenzhx pushed a commit to chenzhx/spark that referenced this pull request


          [SPARK-39148][SQL] DS V2 aggregate push down can work with OFFSET or …

fc0051f

…LIMIT

### What changes were proposed in this pull request?

This PR refactors the v2 agg pushdown code. The main change is, now we don't build the `Scan` immediately when pushing agg. We did it so before because we want to know the data schema with agg pushed, then we can add cast when rewriting the query plan after pushdown. But the problem is, we build `Scan` too early and can't push down any more operators, while it's common to see LIMIT/OFFSET after agg.

The idea of the refactor is, we don't need to know the data schema with agg pushed. We just give an expectation (the data type should be the same of Spark agg functions), use it to define the output of `ScanBuilderHolder`, and then rewrite the query plan. Later on, when we build the `Scan` and replace `ScanBuilderHolder` with `DataSourceV2ScanRelation`, we check the actual data schema and add a `Project` to do type cast if necessary.

### Why are the changes needed?

support pushing down LIMIT/OFFSET after agg.

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

updated tests

Closes apache#37195 from cloud-fan/agg.

Lead-authored-by: Wenchen Fan <wenchen@databricks.com>
Co-authored-by: Wenchen Fan <cloud0fan@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>

chenzhx added a commit to Kyligence/spark that referenced this pull request


          [SPARK-39148][SQL] DS V2 aggregate push down can work with OFFSET or …

…LIMIT (#505)

* [SPARK-39139][SQL] DS V2 supports push down DS V2 UDF

### What changes were proposed in this pull request?
Currently, Spark DS V2 push-down framework supports push down SQL to data sources.
But the DS V2 push-down framework only support push down the built-in functions to data sources.
Each database have a lot very useful functions which not supported by Spark.
If we can push down these functions into data source, it will reduce disk I/O and network I/O and improve the performance when query databases.

### Why are the changes needed?

1. Spark can leverage the functions supported by databases
2. Improve the query performance.

### Does this PR introduce _any_ user-facing change?
'No'.
New feature.

### How was this patch tested?
New tests.

Closes apache#36593 from beliefer/SPARK-39139.

Authored-by: Jiaan Geng <beliefer@163.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>

* [SPARK-39453][SQL][TESTS][FOLLOWUP] Let `RAND` in filter is more meaningful

### What changes were proposed in this pull request?
apache#36830 makes DS V2 supports push down misc non-aggregate functions(non ANSI).
But he `Rand` in test case looks no meaningful.

### Why are the changes needed?
Let `Rand` in filter is more meaningful.

### Does this PR introduce _any_ user-facing change?
'No'.
Just update test case.

### How was this patch tested?
Just update test case.

Closes apache#37033 from beliefer/SPARK-39453_followup.

Authored-by: Jiaan Geng <beliefer@163.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>

* [SPARK-37527][SQL][FOLLOWUP] Cannot compile COVAR_POP, COVAR_SAMP and CORR in `H2Dialect` if them with `DISTINCT`

### What changes were proposed in this pull request?
apache#35145 compile COVAR_POP, COVAR_SAMP and CORR in H2Dialect.
Because H2 does't support COVAR_POP, COVAR_SAMP and CORR works with DISTINCT.
So apache#35145 introduces a bug that compile COVAR_POP, COVAR_SAMP and CORR if these aggregate functions with DISTINCT.

### Why are the changes needed?
Fix bug that compile COVAR_POP, COVAR_SAMP and CORR if these aggregate functions with DISTINCT.

### Does this PR introduce _any_ user-facing change?
'Yes'.
Bug will be fix.

### How was this patch tested?
New test cases.

Closes apache#37090 from beliefer/SPARK-37527_followup2.

Authored-by: Jiaan Geng <beliefer@163.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>

* [SPARK-39627][SQL] DS V2 pushdown should unify the compile API

### What changes were proposed in this pull request?
Currently, `JdbcDialect` have two API `compileAggregate` and `compileExpression`, we can unify them.

### Why are the changes needed?
Improve ease of use.

### Does this PR introduce _any_ user-facing change?
'No'.
The two API `compileAggregate` call `compileExpression` not changed.

### How was this patch tested?
N/A

Closes apache#37047 from beliefer/SPARK-39627.

Authored-by: Jiaan Geng <beliefer@163.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>

* [SPARK-39384][SQL] Compile built-in linear regression aggregate functions for JDBC dialect

### What changes were proposed in this pull request?
Recently, Spark DS V2 pushdown framework translate a lot of standard linear regression aggregate functions.
Currently, only H2Dialect compile these standard linear regression aggregate functions. This PR compile these standard linear regression aggregate functions for other build-in JDBC dialect.

### Why are the changes needed?
Make build-in JDBC dialect support compile linear regression aggregate push-down.

### Does this PR introduce _any_ user-facing change?
'No'.
New feature.

### How was this patch tested?
New test cases.

Closes apache#37188 from beliefer/SPARK-39384.

Authored-by: Jiaan Geng <beliefer@163.com>
Signed-off-by: Sean Owen <srowen@gmail.com>

* [SPARK-39148][SQL] DS V2 aggregate push down can work with OFFSET or LIMIT

### What changes were proposed in this pull request?

This PR refactors the v2 agg pushdown code. The main change is, now we don't build the `Scan` immediately when pushing agg. We did it so before because we want to know the data schema with agg pushed, then we can add cast when rewriting the query plan after pushdown. But the problem is, we build `Scan` too early and can't push down any more operators, while it's common to see LIMIT/OFFSET after agg.

The idea of the refactor is, we don't need to know the data schema with agg pushed. We just give an expectation (the data type should be the same of Spark agg functions), use it to define the output of `ScanBuilderHolder`, and then rewrite the query plan. Later on, when we build the `Scan` and replace `ScanBuilderHolder` with `DataSourceV2ScanRelation`, we check the actual data schema and add a `Project` to do type cast if necessary.

### Why are the changes needed?

support pushing down LIMIT/OFFSET after agg.

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

updated tests

Closes apache#37195 from cloud-fan/agg.

Lead-authored-by: Wenchen Fan <wenchen@databricks.com>
Co-authored-by: Wenchen Fan <cloud0fan@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>

Co-authored-by: Jiaan Geng <beliefer@163.com>
Co-authored-by: Wenchen Fan <wenchen@databricks.com>
Co-authored-by: Wenchen Fan <cloud0fan@gmail.com>

yhcast0 pushed a commit to yhcast0/spark that referenced this pull request


          [SPARK-39148][SQL] DS V2 aggregate push down can work with OFFSET or …

4c5d98a

…LIMIT (Kyligence#505)

* [SPARK-39139][SQL] DS V2 supports push down DS V2 UDF

### What changes were proposed in this pull request?
Currently, Spark DS V2 push-down framework supports push down SQL to data sources.
But the DS V2 push-down framework only support push down the built-in functions to data sources.
Each database have a lot very useful functions which not supported by Spark.
If we can push down these functions into data source, it will reduce disk I/O and network I/O and improve the performance when query databases.

### Why are the changes needed?

1. Spark can leverage the functions supported by databases
2. Improve the query performance.

### Does this PR introduce _any_ user-facing change?
'No'.
New feature.

### How was this patch tested?
New tests.

Closes apache#36593 from beliefer/SPARK-39139.

Authored-by: Jiaan Geng <beliefer@163.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>

* [SPARK-39453][SQL][TESTS][FOLLOWUP] Let `RAND` in filter is more meaningful

### What changes were proposed in this pull request?
apache#36830 makes DS V2 supports push down misc non-aggregate functions(non ANSI).
But he `Rand` in test case looks no meaningful.

### Why are the changes needed?
Let `Rand` in filter is more meaningful.

### Does this PR introduce _any_ user-facing change?
'No'.
Just update test case.

### How was this patch tested?
Just update test case.

Closes apache#37033 from beliefer/SPARK-39453_followup.

Authored-by: Jiaan Geng <beliefer@163.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>

* [SPARK-37527][SQL][FOLLOWUP] Cannot compile COVAR_POP, COVAR_SAMP and CORR in `H2Dialect` if them with `DISTINCT`

### What changes were proposed in this pull request?
apache#35145 compile COVAR_POP, COVAR_SAMP and CORR in H2Dialect.
Because H2 does't support COVAR_POP, COVAR_SAMP and CORR works with DISTINCT.
So apache#35145 introduces a bug that compile COVAR_POP, COVAR_SAMP and CORR if these aggregate functions with DISTINCT.

### Why are the changes needed?
Fix bug that compile COVAR_POP, COVAR_SAMP and CORR if these aggregate functions with DISTINCT.

### Does this PR introduce _any_ user-facing change?
'Yes'.
Bug will be fix.

### How was this patch tested?
New test cases.

Closes apache#37090 from beliefer/SPARK-37527_followup2.

Authored-by: Jiaan Geng <beliefer@163.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>

* [SPARK-39627][SQL] DS V2 pushdown should unify the compile API

### What changes were proposed in this pull request?
Currently, `JdbcDialect` have two API `compileAggregate` and `compileExpression`, we can unify them.

### Why are the changes needed?
Improve ease of use.

### Does this PR introduce _any_ user-facing change?
'No'.
The two API `compileAggregate` call `compileExpression` not changed.

### How was this patch tested?
N/A

Closes apache#37047 from beliefer/SPARK-39627.

Authored-by: Jiaan Geng <beliefer@163.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>

* [SPARK-39384][SQL] Compile built-in linear regression aggregate functions for JDBC dialect

### What changes were proposed in this pull request?
Recently, Spark DS V2 pushdown framework translate a lot of standard linear regression aggregate functions.
Currently, only H2Dialect compile these standard linear regression aggregate functions. This PR compile these standard linear regression aggregate functions for other build-in JDBC dialect.

### Why are the changes needed?
Make build-in JDBC dialect support compile linear regression aggregate push-down.

### Does this PR introduce _any_ user-facing change?
'No'.
New feature.

### How was this patch tested?
New test cases.

Closes apache#37188 from beliefer/SPARK-39384.

Authored-by: Jiaan Geng <beliefer@163.com>
Signed-off-by: Sean Owen <srowen@gmail.com>

* [SPARK-39148][SQL] DS V2 aggregate push down can work with OFFSET or LIMIT

### What changes were proposed in this pull request?

This PR refactors the v2 agg pushdown code. The main change is, now we don't build the `Scan` immediately when pushing agg. We did it so before because we want to know the data schema with agg pushed, then we can add cast when rewriting the query plan after pushdown. But the problem is, we build `Scan` too early and can't push down any more operators, while it's common to see LIMIT/OFFSET after agg.

The idea of the refactor is, we don't need to know the data schema with agg pushed. We just give an expectation (the data type should be the same of Spark agg functions), use it to define the output of `ScanBuilderHolder`, and then rewrite the query plan. Later on, when we build the `Scan` and replace `ScanBuilderHolder` with `DataSourceV2ScanRelation`, we check the actual data schema and add a `Project` to do type cast if necessary.

### Why are the changes needed?

support pushing down LIMIT/OFFSET after agg.

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

updated tests

Closes apache#37195 from cloud-fan/agg.

Lead-authored-by: Wenchen Fan <wenchen@databricks.com>
Co-authored-by: Wenchen Fan <cloud0fan@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>

Co-authored-by: Jiaan Geng <beliefer@163.com>
Co-authored-by: Wenchen Fan <wenchen@databricks.com>
Co-authored-by: Wenchen Fan <cloud0fan@gmail.com>

yhcast0 pushed a commit to Kyligence/spark that referenced this pull request


          [SPARK-39148][SQL] DS V2 aggregate push down can work with OFFSET or …

708f64d

…LIMIT (#505)

* [SPARK-39139][SQL] DS V2 supports push down DS V2 UDF

### What changes were proposed in this pull request?
Currently, Spark DS V2 push-down framework supports push down SQL to data sources.
But the DS V2 push-down framework only support push down the built-in functions to data sources.
Each database have a lot very useful functions which not supported by Spark.
If we can push down these functions into data source, it will reduce disk I/O and network I/O and improve the performance when query databases.

### Why are the changes needed?

1. Spark can leverage the functions supported by databases
2. Improve the query performance.

### Does this PR introduce _any_ user-facing change?
'No'.
New feature.

### How was this patch tested?
New tests.

Closes apache#36593 from beliefer/SPARK-39139.

Authored-by: Jiaan Geng <beliefer@163.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>

* [SPARK-39453][SQL][TESTS][FOLLOWUP] Let `RAND` in filter is more meaningful

### What changes were proposed in this pull request?
apache#36830 makes DS V2 supports push down misc non-aggregate functions(non ANSI).
But he `Rand` in test case looks no meaningful.

### Why are the changes needed?
Let `Rand` in filter is more meaningful.

### Does this PR introduce _any_ user-facing change?
'No'.
Just update test case.

### How was this patch tested?
Just update test case.

Closes apache#37033 from beliefer/SPARK-39453_followup.

Authored-by: Jiaan Geng <beliefer@163.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>

* [SPARK-37527][SQL][FOLLOWUP] Cannot compile COVAR_POP, COVAR_SAMP and CORR in `H2Dialect` if them with `DISTINCT`

### What changes were proposed in this pull request?
apache#35145 compile COVAR_POP, COVAR_SAMP and CORR in H2Dialect.
Because H2 does't support COVAR_POP, COVAR_SAMP and CORR works with DISTINCT.
So apache#35145 introduces a bug that compile COVAR_POP, COVAR_SAMP and CORR if these aggregate functions with DISTINCT.

### Why are the changes needed?
Fix bug that compile COVAR_POP, COVAR_SAMP and CORR if these aggregate functions with DISTINCT.

### Does this PR introduce _any_ user-facing change?
'Yes'.
Bug will be fix.

### How was this patch tested?
New test cases.

Closes apache#37090 from beliefer/SPARK-37527_followup2.

Authored-by: Jiaan Geng <beliefer@163.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>

* [SPARK-39627][SQL] DS V2 pushdown should unify the compile API

### What changes were proposed in this pull request?
Currently, `JdbcDialect` have two API `compileAggregate` and `compileExpression`, we can unify them.

### Why are the changes needed?
Improve ease of use.

### Does this PR introduce _any_ user-facing change?
'No'.
The two API `compileAggregate` call `compileExpression` not changed.

### How was this patch tested?
N/A

Closes apache#37047 from beliefer/SPARK-39627.

Authored-by: Jiaan Geng <beliefer@163.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>

* [SPARK-39384][SQL] Compile built-in linear regression aggregate functions for JDBC dialect

### What changes were proposed in this pull request?
Recently, Spark DS V2 pushdown framework translate a lot of standard linear regression aggregate functions.
Currently, only H2Dialect compile these standard linear regression aggregate functions. This PR compile these standard linear regression aggregate functions for other build-in JDBC dialect.

### Why are the changes needed?
Make build-in JDBC dialect support compile linear regression aggregate push-down.

### Does this PR introduce _any_ user-facing change?
'No'.
New feature.

### How was this patch tested?
New test cases.

Closes apache#37188 from beliefer/SPARK-39384.

Authored-by: Jiaan Geng <beliefer@163.com>
Signed-off-by: Sean Owen <srowen@gmail.com>

* [SPARK-39148][SQL] DS V2 aggregate push down can work with OFFSET or LIMIT

### What changes were proposed in this pull request?

This PR refactors the v2 agg pushdown code. The main change is, now we don't build the `Scan` immediately when pushing agg. We did it so before because we want to know the data schema with agg pushed, then we can add cast when rewriting the query plan after pushdown. But the problem is, we build `Scan` too early and can't push down any more operators, while it's common to see LIMIT/OFFSET after agg.

The idea of the refactor is, we don't need to know the data schema with agg pushed. We just give an expectation (the data type should be the same of Spark agg functions), use it to define the output of `ScanBuilderHolder`, and then rewrite the query plan. Later on, when we build the `Scan` and replace `ScanBuilderHolder` with `DataSourceV2ScanRelation`, we check the actual data schema and add a `Project` to do type cast if necessary.

### Why are the changes needed?

support pushing down LIMIT/OFFSET after agg.

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

updated tests

Closes apache#37195 from cloud-fan/agg.

Lead-authored-by: Wenchen Fan <wenchen@databricks.com>
Co-authored-by: Wenchen Fan <cloud0fan@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>

Co-authored-by: Jiaan Geng <beliefer@163.com>
Co-authored-by: Wenchen Fan <wenchen@databricks.com>
Co-authored-by: Wenchen Fan <cloud0fan@gmail.com>

zheniantoushipashi pushed a commit to Kyligence/spark that referenced this pull request


          [SPARK-39148][SQL] DS V2 aggregate push down can work with OFFSET or …

b895ea9

…LIMIT (#505)

* [SPARK-39139][SQL] DS V2 supports push down DS V2 UDF

### What changes were proposed in this pull request?
Currently, Spark DS V2 push-down framework supports push down SQL to data sources.
But the DS V2 push-down framework only support push down the built-in functions to data sources.
Each database have a lot very useful functions which not supported by Spark.
If we can push down these functions into data source, it will reduce disk I/O and network I/O and improve the performance when query databases.

### Why are the changes needed?

1. Spark can leverage the functions supported by databases
2. Improve the query performance.

### Does this PR introduce _any_ user-facing change?
'No'.
New feature.

### How was this patch tested?
New tests.

Closes apache#36593 from beliefer/SPARK-39139.

Authored-by: Jiaan Geng <beliefer@163.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>

* [SPARK-39453][SQL][TESTS][FOLLOWUP] Let `RAND` in filter is more meaningful

### What changes were proposed in this pull request?
apache#36830 makes DS V2 supports push down misc non-aggregate functions(non ANSI).
But he `Rand` in test case looks no meaningful.

### Why are the changes needed?
Let `Rand` in filter is more meaningful.

### Does this PR introduce _any_ user-facing change?
'No'.
Just update test case.

### How was this patch tested?
Just update test case.

Closes apache#37033 from beliefer/SPARK-39453_followup.

Authored-by: Jiaan Geng <beliefer@163.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>

* [SPARK-37527][SQL][FOLLOWUP] Cannot compile COVAR_POP, COVAR_SAMP and CORR in `H2Dialect` if them with `DISTINCT`

### What changes were proposed in this pull request?
apache#35145 compile COVAR_POP, COVAR_SAMP and CORR in H2Dialect.
Because H2 does't support COVAR_POP, COVAR_SAMP and CORR works with DISTINCT.
So apache#35145 introduces a bug that compile COVAR_POP, COVAR_SAMP and CORR if these aggregate functions with DISTINCT.

### Why are the changes needed?
Fix bug that compile COVAR_POP, COVAR_SAMP and CORR if these aggregate functions with DISTINCT.

### Does this PR introduce _any_ user-facing change?
'Yes'.
Bug will be fix.

### How was this patch tested?
New test cases.

Closes apache#37090 from beliefer/SPARK-37527_followup2.

Authored-by: Jiaan Geng <beliefer@163.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>

* [SPARK-39627][SQL] DS V2 pushdown should unify the compile API

### What changes were proposed in this pull request?
Currently, `JdbcDialect` have two API `compileAggregate` and `compileExpression`, we can unify them.

### Why are the changes needed?
Improve ease of use.

### Does this PR introduce _any_ user-facing change?
'No'.
The two API `compileAggregate` call `compileExpression` not changed.

### How was this patch tested?
N/A

Closes apache#37047 from beliefer/SPARK-39627.

Authored-by: Jiaan Geng <beliefer@163.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>

* [SPARK-39384][SQL] Compile built-in linear regression aggregate functions for JDBC dialect

### What changes were proposed in this pull request?
Recently, Spark DS V2 pushdown framework translate a lot of standard linear regression aggregate functions.
Currently, only H2Dialect compile these standard linear regression aggregate functions. This PR compile these standard linear regression aggregate functions for other build-in JDBC dialect.

### Why are the changes needed?
Make build-in JDBC dialect support compile linear regression aggregate push-down.

### Does this PR introduce _any_ user-facing change?
'No'.
New feature.

### How was this patch tested?
New test cases.

Closes apache#37188 from beliefer/SPARK-39384.

Authored-by: Jiaan Geng <beliefer@163.com>
Signed-off-by: Sean Owen <srowen@gmail.com>

* [SPARK-39148][SQL] DS V2 aggregate push down can work with OFFSET or LIMIT

### What changes were proposed in this pull request?

This PR refactors the v2 agg pushdown code. The main change is, now we don't build the `Scan` immediately when pushing agg. We did it so before because we want to know the data schema with agg pushed, then we can add cast when rewriting the query plan after pushdown. But the problem is, we build `Scan` too early and can't push down any more operators, while it's common to see LIMIT/OFFSET after agg.

The idea of the refactor is, we don't need to know the data schema with agg pushed. We just give an expectation (the data type should be the same of Spark agg functions), use it to define the output of `ScanBuilderHolder`, and then rewrite the query plan. Later on, when we build the `Scan` and replace `ScanBuilderHolder` with `DataSourceV2ScanRelation`, we check the actual data schema and add a `Project` to do type cast if necessary.

### Why are the changes needed?

support pushing down LIMIT/OFFSET after agg.

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

updated tests

Closes apache#37195 from cloud-fan/agg.

Lead-authored-by: Wenchen Fan <wenchen@databricks.com>
Co-authored-by: Wenchen Fan <cloud0fan@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>

Co-authored-by: Jiaan Geng <beliefer@163.com>
Co-authored-by: Wenchen Fan <wenchen@databricks.com>
Co-authored-by: Wenchen Fan <cloud0fan@gmail.com>

leejaywei pushed a commit to Kyligence/spark that referenced this pull request


          [SPARK-39148][SQL] DS V2 aggregate push down can work with OFFSET or …

7722a5f

…LIMIT (#505)

* [SPARK-39139][SQL] DS V2 supports push down DS V2 UDF

Currently, Spark DS V2 push-down framework supports push down SQL to data sources.
But the DS V2 push-down framework only support push down the built-in functions to data sources.
Each database have a lot very useful functions which not supported by Spark.
If we can push down these functions into data source, it will reduce disk I/O and network I/O and improve the performance when query databases.

1. Spark can leverage the functions supported by databases
2. Improve the query performance.

'No'.
New feature.

New tests.

Closes apache#36593 from beliefer/SPARK-39139.

Authored-by: Jiaan Geng <beliefer@163.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>

* [SPARK-39453][SQL][TESTS][FOLLOWUP] Let `RAND` in filter is more meaningful

apache#36830 makes DS V2 supports push down misc non-aggregate functions(non ANSI).
But he `Rand` in test case looks no meaningful.

Let `Rand` in filter is more meaningful.

'No'.
Just update test case.

Just update test case.

Closes apache#37033 from beliefer/SPARK-39453_followup.

Authored-by: Jiaan Geng <beliefer@163.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>

* [SPARK-37527][SQL][FOLLOWUP] Cannot compile COVAR_POP, COVAR_SAMP and CORR in `H2Dialect` if them with `DISTINCT`

apache#35145 compile COVAR_POP, COVAR_SAMP and CORR in H2Dialect.
Because H2 does't support COVAR_POP, COVAR_SAMP and CORR works with DISTINCT.
So apache#35145 introduces a bug that compile COVAR_POP, COVAR_SAMP and CORR if these aggregate functions with DISTINCT.

Fix bug that compile COVAR_POP, COVAR_SAMP and CORR if these aggregate functions with DISTINCT.

'Yes'.
Bug will be fix.

New test cases.

Closes apache#37090 from beliefer/SPARK-37527_followup2.

Authored-by: Jiaan Geng <beliefer@163.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>

* [SPARK-39627][SQL] DS V2 pushdown should unify the compile API

Currently, `JdbcDialect` have two API `compileAggregate` and `compileExpression`, we can unify them.

Improve ease of use.

'No'.
The two API `compileAggregate` call `compileExpression` not changed.

N/A

Closes apache#37047 from beliefer/SPARK-39627.

Authored-by: Jiaan Geng <beliefer@163.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>

* [SPARK-39384][SQL] Compile built-in linear regression aggregate functions for JDBC dialect

Recently, Spark DS V2 pushdown framework translate a lot of standard linear regression aggregate functions.
Currently, only H2Dialect compile these standard linear regression aggregate functions. This PR compile these standard linear regression aggregate functions for other build-in JDBC dialect.

Make build-in JDBC dialect support compile linear regression aggregate push-down.

'No'.
New feature.

New test cases.

Closes apache#37188 from beliefer/SPARK-39384.

Authored-by: Jiaan Geng <beliefer@163.com>
Signed-off-by: Sean Owen <srowen@gmail.com>

* [SPARK-39148][SQL] DS V2 aggregate push down can work with OFFSET or LIMIT

This PR refactors the v2 agg pushdown code. The main change is, now we don't build the `Scan` immediately when pushing agg. We did it so before because we want to know the data schema with agg pushed, then we can add cast when rewriting the query plan after pushdown. But the problem is, we build `Scan` too early and can't push down any more operators, while it's common to see LIMIT/OFFSET after agg.

The idea of the refactor is, we don't need to know the data schema with agg pushed. We just give an expectation (the data type should be the same of Spark agg functions), use it to define the output of `ScanBuilderHolder`, and then rewrite the query plan. Later on, when we build the `Scan` and replace `ScanBuilderHolder` with `DataSourceV2ScanRelation`, we check the actual data schema and add a `Project` to do type cast if necessary.

support pushing down LIMIT/OFFSET after agg.

no

updated tests

Closes apache#37195 from cloud-fan/agg.

Lead-authored-by: Wenchen Fan <wenchen@databricks.com>
Co-authored-by: Wenchen Fan <cloud0fan@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>

Co-authored-by: Jiaan Geng <beliefer@163.com>
Co-authored-by: Wenchen Fan <wenchen@databricks.com>
Co-authored-by: Wenchen Fan <cloud0fan@gmail.com>

leejaywei pushed a commit to Kyligence/spark that referenced this pull request


          [SPARK-39148][SQL] DS V2 aggregate push down can work with OFFSET or …

3866d28

…LIMIT (#505)

* [SPARK-39139][SQL] DS V2 supports push down DS V2 UDF

Currently, Spark DS V2 push-down framework supports push down SQL to data sources.
But the DS V2 push-down framework only support push down the built-in functions to data sources.
Each database have a lot very useful functions which not supported by Spark.
If we can push down these functions into data source, it will reduce disk I/O and network I/O and improve the performance when query databases.

1. Spark can leverage the functions supported by databases
2. Improve the query performance.

'No'.
New feature.

New tests.

Closes apache#36593 from beliefer/SPARK-39139.

Authored-by: Jiaan Geng <beliefer@163.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>

* [SPARK-39453][SQL][TESTS][FOLLOWUP] Let `RAND` in filter is more meaningful

apache#36830 makes DS V2 supports push down misc non-aggregate functions(non ANSI).
But he `Rand` in test case looks no meaningful.

Let `Rand` in filter is more meaningful.

'No'.
Just update test case.

Just update test case.

Closes apache#37033 from beliefer/SPARK-39453_followup.

Authored-by: Jiaan Geng <beliefer@163.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>

* [SPARK-37527][SQL][FOLLOWUP] Cannot compile COVAR_POP, COVAR_SAMP and CORR in `H2Dialect` if them with `DISTINCT`

apache#35145 compile COVAR_POP, COVAR_SAMP and CORR in H2Dialect.
Because H2 does't support COVAR_POP, COVAR_SAMP and CORR works with DISTINCT.
So apache#35145 introduces a bug that compile COVAR_POP, COVAR_SAMP and CORR if these aggregate functions with DISTINCT.

Fix bug that compile COVAR_POP, COVAR_SAMP and CORR if these aggregate functions with DISTINCT.

'Yes'.
Bug will be fix.

New test cases.

Closes apache#37090 from beliefer/SPARK-37527_followup2.

Authored-by: Jiaan Geng <beliefer@163.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>

* [SPARK-39627][SQL] DS V2 pushdown should unify the compile API

Currently, `JdbcDialect` have two API `compileAggregate` and `compileExpression`, we can unify them.

Improve ease of use.

'No'.
The two API `compileAggregate` call `compileExpression` not changed.

N/A

Closes apache#37047 from beliefer/SPARK-39627.

Authored-by: Jiaan Geng <beliefer@163.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>

* [SPARK-39384][SQL] Compile built-in linear regression aggregate functions for JDBC dialect

Recently, Spark DS V2 pushdown framework translate a lot of standard linear regression aggregate functions.
Currently, only H2Dialect compile these standard linear regression aggregate functions. This PR compile these standard linear regression aggregate functions for other build-in JDBC dialect.

Make build-in JDBC dialect support compile linear regression aggregate push-down.

'No'.
New feature.

New test cases.

Closes apache#37188 from beliefer/SPARK-39384.

Authored-by: Jiaan Geng <beliefer@163.com>
Signed-off-by: Sean Owen <srowen@gmail.com>

* [SPARK-39148][SQL] DS V2 aggregate push down can work with OFFSET or LIMIT

This PR refactors the v2 agg pushdown code. The main change is, now we don't build the `Scan` immediately when pushing agg. We did it so before because we want to know the data schema with agg pushed, then we can add cast when rewriting the query plan after pushdown. But the problem is, we build `Scan` too early and can't push down any more operators, while it's common to see LIMIT/OFFSET after agg.

The idea of the refactor is, we don't need to know the data schema with agg pushed. We just give an expectation (the data type should be the same of Spark agg functions), use it to define the output of `ScanBuilderHolder`, and then rewrite the query plan. Later on, when we build the `Scan` and replace `ScanBuilderHolder` with `DataSourceV2ScanRelation`, we check the actual data schema and add a `Project` to do type cast if necessary.

support pushing down LIMIT/OFFSET after agg.

no

updated tests

Closes apache#37195 from cloud-fan/agg.

Lead-authored-by: Wenchen Fan <wenchen@databricks.com>
Co-authored-by: Wenchen Fan <cloud0fan@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>

Co-authored-by: Jiaan Geng <beliefer@163.com>
Co-authored-by: Wenchen Fan <wenchen@databricks.com>
Co-authored-by: Wenchen Fan <cloud0fan@gmail.com>

RolatZhang pushed a commit to Kyligence/spark that referenced this pull request


          [SPARK-39148][SQL] DS V2 aggregate push down can work with OFFSET or …

05d989b

…LIMIT (#505)

* [SPARK-39139][SQL] DS V2 supports push down DS V2 UDF

Currently, Spark DS V2 push-down framework supports push down SQL to data sources.
But the DS V2 push-down framework only support push down the built-in functions to data sources.
Each database have a lot very useful functions which not supported by Spark.
If we can push down these functions into data source, it will reduce disk I/O and network I/O and improve the performance when query databases.

1. Spark can leverage the functions supported by databases
2. Improve the query performance.

'No'.
New feature.

New tests.

Closes apache#36593 from beliefer/SPARK-39139.

Authored-by: Jiaan Geng <beliefer@163.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>

* [SPARK-39453][SQL][TESTS][FOLLOWUP] Let `RAND` in filter is more meaningful

apache#36830 makes DS V2 supports push down misc non-aggregate functions(non ANSI).
But he `Rand` in test case looks no meaningful.

Let `Rand` in filter is more meaningful.

'No'.
Just update test case.

Just update test case.

Closes apache#37033 from beliefer/SPARK-39453_followup.

Authored-by: Jiaan Geng <beliefer@163.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>

* [SPARK-37527][SQL][FOLLOWUP] Cannot compile COVAR_POP, COVAR_SAMP and CORR in `H2Dialect` if them with `DISTINCT`

apache#35145 compile COVAR_POP, COVAR_SAMP and CORR in H2Dialect.
Because H2 does't support COVAR_POP, COVAR_SAMP and CORR works with DISTINCT.
So apache#35145 introduces a bug that compile COVAR_POP, COVAR_SAMP and CORR if these aggregate functions with DISTINCT.

Fix bug that compile COVAR_POP, COVAR_SAMP and CORR if these aggregate functions with DISTINCT.

'Yes'.
Bug will be fix.

New test cases.

Closes apache#37090 from beliefer/SPARK-37527_followup2.

Authored-by: Jiaan Geng <beliefer@163.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>

* [SPARK-39627][SQL] DS V2 pushdown should unify the compile API

Currently, `JdbcDialect` have two API `compileAggregate` and `compileExpression`, we can unify them.

Improve ease of use.

'No'.
The two API `compileAggregate` call `compileExpression` not changed.

N/A

Closes apache#37047 from beliefer/SPARK-39627.

Authored-by: Jiaan Geng <beliefer@163.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>

* [SPARK-39384][SQL] Compile built-in linear regression aggregate functions for JDBC dialect

Recently, Spark DS V2 pushdown framework translate a lot of standard linear regression aggregate functions.
Currently, only H2Dialect compile these standard linear regression aggregate functions. This PR compile these standard linear regression aggregate functions for other build-in JDBC dialect.

Make build-in JDBC dialect support compile linear regression aggregate push-down.

'No'.
New feature.

New test cases.

Closes apache#37188 from beliefer/SPARK-39384.

Authored-by: Jiaan Geng <beliefer@163.com>
Signed-off-by: Sean Owen <srowen@gmail.com>

* [SPARK-39148][SQL] DS V2 aggregate push down can work with OFFSET or LIMIT

This PR refactors the v2 agg pushdown code. The main change is, now we don't build the `Scan` immediately when pushing agg. We did it so before because we want to know the data schema with agg pushed, then we can add cast when rewriting the query plan after pushdown. But the problem is, we build `Scan` too early and can't push down any more operators, while it's common to see LIMIT/OFFSET after agg.

The idea of the refactor is, we don't need to know the data schema with agg pushed. We just give an expectation (the data type should be the same of Spark agg functions), use it to define the output of `ScanBuilderHolder`, and then rewrite the query plan. Later on, when we build the `Scan` and replace `ScanBuilderHolder` with `DataSourceV2ScanRelation`, we check the actual data schema and add a `Project` to do type cast if necessary.

support pushing down LIMIT/OFFSET after agg.

no

updated tests

Closes apache#37195 from cloud-fan/agg.

Lead-authored-by: Wenchen Fan <wenchen@databricks.com>
Co-authored-by: Wenchen Fan <cloud0fan@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>

Co-authored-by: Jiaan Geng <beliefer@163.com>
Co-authored-by: Wenchen Fan <wenchen@databricks.com>
Co-authored-by: Wenchen Fan <cloud0fan@gmail.com>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

SQL