[SPARK-35794][SQL] Allow custom plugin for AQE cost evaluator #32944

c21 · 2021-06-17T08:31:32Z

What changes were proposed in this pull request?

Current AQE has cost evaluator to decide whether to use new plan after replanning. The current used evaluator is SimpleCostEvaluator to make decision based on number of shuffle in the query plan. This is not perfect cost evaluator, and different production environments might want to use different custom evaluators. E.g., sometimes we might want to still do skew join even though it might introduce extra shuffle (trade off resource for better latency), sometimes we might want to take sort into consideration for cost as well. Take our own setting as an example, we are using a custom remote shuffle service (Cosco), and the cost model is more complicated. So We want to make the cost evaluator to be pluggable, and developers can implement their own CostEvaluator subclass and plug in dynamically based on configuration.

The approach is to introduce a new config to allow define sub-class name of CostEvaluator - spark.sql.adaptive.customCostEvaluatorClass. And add CostEvaluator.instantiate to instantiate the cost evaluator class in AdaptiveSparkPlanExec.costEvaluator.

Why are the changes needed?

Make AQE cost evaluation more flexible.

Does this PR introduce any user-facing change?

No but an internal config is introduced - spark.sql.adaptive.customCostEvaluatorClass to allow custom implementation of CostEvaluator.

How was this patch tested?

Added unit test in AdaptiveQueryExecSuite.scala.

c21 · 2021-06-17T08:33:19Z

cc @cloud-fan could you help take a look when you have time? Thanks.

cloud-fan · 2021-06-17T09:29:25Z

does it work well with #32816 ?

c21 · 2021-06-17T09:36:14Z

does it work well with #32816 ?

@cloud-fan - I think so. If we decide merge this first, then in #32816, we don't need the extra config spark.sql.adaptive.forceEnableSkewJoin. Developers/users can set spark.sql.adaptive.costEvaluatorClass to SkewJoinAwareCostEvaluator and it should work, cc @ulysses-you FYI, thanks.

SparkQA · 2021-06-17T10:00:57Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44440/

SparkQA · 2021-06-17T10:35:20Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44440/

ulysses-you · 2021-06-17T12:48:47Z

@c21 thank you for ping me.

Not sure it's worth to make cost evaluator as plugin. You mentioned sort (I think it's local sort, isn't it ?), and can you provide a real use case about it ?

SparkQA · 2021-06-17T13:31:39Z

Test build #139911 has finished for PR 32944 at commit 6670938.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2021-06-17T13:40:19Z

Developers/users can set spark.sql.adaptive.costEvaluatorClass to SkewJoinAwareCostEvaluator and it should work

I don't think it's that simple. If force-skew-join-handling is enabled, Spark must use SkewJoinAwareCostEvaluator, not a user-specified one.

c21 · 2021-06-17T23:02:04Z

You mentioned sort (I think it's local sort, isn't it ?), and can you provide a real use case about it ?

@ulysses-you - e.g.

SortAggregate
- SortMergeJoin
  - Sort(Shuffle(Scan))
  - Sort(Shuffle(Scan))

AQE might change it to

SortAggregate
- Sort
  - ShuffledHashJoin
    - Shuffle(Scan)
    - Shuffle(Scan)

With our Cosco remote shuffle service, we already implemented the sorted shuffle (Sort(Shuffle) where Sort and Shuffle down in shuffle service side together at same time), and it would be more efficient than doing Sort separately in Spark. So a Sort(Shuffle) is more efficient than a pair of Shuffle and Sort in our case. This influences our AQE decision and we have to have a custom cost evaluator. As we can see, we need a separate cost evaluator for forcing skew join and we might have more in the future. Another aspect is Cosco hasn't been open source yet, so we want a clean interface for custom cost evaluator, instead of always maintaining a fork change on our side.

c21 · 2021-06-17T23:04:43Z

I don't think it's that simple. If force-skew-join-handling is enabled, Spark must use SkewJoinAwareCostEvaluator, not a user-specified one.

@cloud-fan - from my checking of #32816, it looks like the only logic controlled by the new config spark.sql.adaptive.forceEnableSkewJoin, is to choose a different cost evaluator - SkewJoinAwareCostEvaluator. My idea is to not introduce the new config, but we can just set spark.sql.adaptive.costEvaluatorClass to SkewJoinAwareCostEvaluator to enable force skew join.

ulysses-you · 2021-06-18T01:48:58Z

@c21 thanks for the explaination, the example SortAggregate(SMJ) to SortAggregate(SHJ) seems useful. But about the usage, I agree with @cloud-fan , the boolean config of forceEnableSkewJoin is necessary and more easy for user. A class name is a little hack for user if they want to optimize skew join anyway.

c21 · 2021-06-18T01:53:10Z

the boolean config of forceEnableSkewJoin is necessary and more easy for user. A class name is a little hack for user if they want to optimize skew join anyway.

@ulysses-you - sure, I agree with boolean config is more intuitive and easier to use. If we do need the boolean config, we can add special logic in AdaptiveSparkPlanExec.costEvaluator, to use SkewJoinAwareCostEvaluator when spark.sql.adaptive.forceEnableSkewJoin is true, instead of whatever user sets for spark.sql.adaptive.costEvaluatorClass. It just a matter of priority between different configs.

cloud-fan · 2021-07-01T18:35:35Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

+      .version("3.2.0")
+      .internal()
+      .stringConf
+      .createWithDefault("org.apache.spark.sql.execution.adaptive.SimpleCostEvaluator")


We can make it an optional conf: spark.sql.adaptive.customCostEvaluatorClass. If not set, we use the builtin impl.

@cloud-fan - sure, updated.

cloud-fan · 2021-07-01T18:43:08Z

sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/costing.scala

+   */
+  def instantiate(className: String): CostEvaluator = {
+    logDebug(s"Creating CostEvaluator $className")
+    val clazz = Utils.classForName[CostEvaluator](className)


We can use the standard API in Spark: Utils.loadExtensions

@cloud-fan - good call, updated.

cloud-fan · 2021-07-01T18:44:16Z

sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/simpleCosting.scala

@@ -38,7 +38,7 @@ case class SimpleCost(value: Long) extends Cost {
 * A simple implementation of [[CostEvaluator]], which counts the number of
 * [[ShuffleExchangeLike]] nodes in the plan.
 */
-object SimpleCostEvaluator extends CostEvaluator {
+case class SimpleCostEvaluator() extends CostEvaluator {


This can still be an object, if we follow https://github.com/apache/spark/pull/32944/files#r662513062

@cloud-fan - yeah, updated.

SparkQA · 2021-07-02T00:35:17Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45060/

SparkQA · 2021-07-02T01:08:12Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45060/

cloud-fan · 2021-07-02T03:15:43Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

@@ -3555,6 +3564,9 @@ class SQLConf extends Serializable with Logging {

  def coalesceShufflePartitionsEnabled: Boolean = getConf(COALESCE_PARTITIONS_ENABLED)

+  def adaptiveCustomCostEvaluatorClass: Option[String] =


nit: we don't have to create a method here if it's only called once

@cloud-fan - sure, updated.

cloud-fan · 2021-07-02T03:18:55Z

sql/core/src/test/scala/org/apache/spark/sql/execution/adaptive/AdaptiveQueryExecSuite.scala

+      val query = "SELECT * FROM testData join testData2 ON key = a where value = '1'"
+
+      withSQLConf(SQLConf.ADAPTIVE_CUSTOM_COST_EVALUATOR_CLASS.key ->
+        "org.apache.spark.sql.execution.adaptive.SimpleShuffleSortCostEvaluator") {


does this custom cost evaluator change the query plan? It seems to be the same with the builtin cost evaluator.

@cloud-fan - this evaluator does not change plan, and to be the same with the builtin evaluator for this query. Do we want to come up a different one here? I think this just validates the custom evaluator works.

SGTM, let's leave it then

SparkQA · 2021-07-02T04:01:30Z

Test build #140547 has finished for PR 32944 at commit 404fe35.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
.doc(\"The custom cost evaluator class to be used for adaptive execution. If not being set,\" +

cloud-fan · 2021-07-02T08:09:47Z

@c21 can you fix the code conflicts?

c21 · 2021-07-02T08:15:51Z

@cloud-fan - thanks, just rebased to latest master.

SparkQA · 2021-07-02T08:45:30Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45079/

sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AdaptiveSparkPlanExec.scala

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

HyukjinKwon · 2021-07-02T09:10:34Z

@c21, can you at least mark CostEvaluator with @Unstable API tag? Also please add a note that it is subject to be moved or changed in the near future.

SparkQA · 2021-07-02T09:18:37Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45079/

SparkQA · 2021-07-02T09:28:21Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45082/

SparkQA · 2021-07-02T10:02:34Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45082/

SparkQA · 2021-07-02T12:21:07Z

Test build #140567 has finished for PR 32944 at commit e202aa8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-07-02T13:04:53Z

Test build #140570 has finished for PR 32944 at commit c5ed8e7.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

c21 · 2021-07-04T06:13:27Z

@HyukjinKwon - updated per discussion, and this is ready for review again, thanks.

SparkQA · 2021-07-04T07:37:07Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45131/

SparkQA · 2021-07-04T08:09:32Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45131/

SparkQA · 2021-07-04T11:17:36Z

Test build #140618 has finished for PR 32944 at commit ac5c121.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2021-07-05T00:31:06Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

+    buildConf("spark.sql.adaptive.customCostEvaluatorClass")
+      .doc("The custom cost evaluator class to be used for adaptive execution. If not being set," +
+        " Spark will use its own SimpleCostEvaluator by default.")
+      .version("3.2.0")


the only think is that the version has to be 3.3.0 since we cut the branch now. Since this PR won't likely affect anything in the main code, I am okay with merging to 3.2.0 either tho. I will leave it to @cloud-fan and you.

3.2 is the first version that enables AQE by default, and this seems to be a useful extension. Let's include it in 3.2.

cloud-fan · 2021-07-05T09:06:37Z

thanks, merging to master/3.2!

### What changes were proposed in this pull request? Current AQE has cost evaluator to decide whether to use new plan after replanning. The current used evaluator is `SimpleCostEvaluator` to make decision based on number of shuffle in the query plan. This is not perfect cost evaluator, and different production environments might want to use different custom evaluators. E.g., sometimes we might want to still do skew join even though it might introduce extra shuffle (trade off resource for better latency), sometimes we might want to take sort into consideration for cost as well. Take our own setting as an example, we are using a custom remote shuffle service (Cosco), and the cost model is more complicated. So We want to make the cost evaluator to be pluggable, and developers can implement their own `CostEvaluator` subclass and plug in dynamically based on configuration. The approach is to introduce a new config to allow define sub-class name of `CostEvaluator` - `spark.sql.adaptive.customCostEvaluatorClass`. And add `CostEvaluator.instantiate` to instantiate the cost evaluator class in `AdaptiveSparkPlanExec.costEvaluator`. ### Why are the changes needed? Make AQE cost evaluation more flexible. ### Does this PR introduce _any_ user-facing change? No but an internal config is introduced - `spark.sql.adaptive.customCostEvaluatorClass` to allow custom implementation of `CostEvaluator`. ### How was this patch tested? Added unit test in `AdaptiveQueryExecSuite.scala`. Closes #32944 from c21/aqe-cost. Authored-by: Cheng Su <chengsu@fb.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 044dddf) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

c21 · 2021-07-06T04:55:25Z

Thank you @cloud-fan and @HyukjinKwon for review!

github-actions bot added the SQL label Jun 17, 2021

cloud-fan reviewed Jul 1, 2021

View reviewed changes

c21 force-pushed the aqe-cost branch from 6670938 to 404fe35 Compare July 1, 2021 23:40

cloud-fan reviewed Jul 2, 2021

View reviewed changes

c21 added 4 commits July 2, 2021 01:15

Allow custom plugin for AQE cost evaluator

494b8bc

Fix newline at end of file

f1ee7ec

Address all comments and rebase to latest master

9fd1bbe

Address comment to remove adaptiveCustomCostEvaluatorClass

c5ed8e7

c21 force-pushed the aqe-cost branch from e202aa8 to c5ed8e7 Compare July 2, 2021 08:15

cloud-fan approved these changes Jul 2, 2021

View reviewed changes

HyukjinKwon reviewed Jul 2, 2021

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AdaptiveSparkPlanExec.scala Show resolved Hide resolved

HyukjinKwon reviewed Jul 2, 2021

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala Outdated Show resolved Hide resolved

Address all comments

ac5c121

HyukjinKwon approved these changes Jul 5, 2021

View reviewed changes

HyukjinKwon reviewed Jul 5, 2021

View reviewed changes

HyukjinKwon approved these changes Jul 5, 2021

View reviewed changes

cloud-fan closed this in 044dddf Jul 5, 2021

c21 deleted the aqe-cost branch July 6, 2021 04:55

		@@ -3555,6 +3564,9 @@ class SQLConf extends Serializable with Logging {

		def coalesceShufflePartitionsEnabled: Boolean = getConf(COALESCE_PARTITIONS_ENABLED)

		def adaptiveCustomCostEvaluatorClass: Option[String] =

[SPARK-35794][SQL] Allow custom plugin for AQE cost evaluator #32944

[SPARK-35794][SQL] Allow custom plugin for AQE cost evaluator #32944

Conversation

c21 commented Jun 17, 2021 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

c21 commented Jun 17, 2021

cloud-fan commented Jun 17, 2021

c21 commented Jun 17, 2021

SparkQA commented Jun 17, 2021

SparkQA commented Jun 17, 2021

ulysses-you commented Jun 17, 2021

SparkQA commented Jun 17, 2021

cloud-fan commented Jun 17, 2021

c21 commented Jun 17, 2021

c21 commented Jun 17, 2021

ulysses-you commented Jun 18, 2021

c21 commented Jun 18, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloud-fan Jul 1, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jul 2, 2021

SparkQA commented Jul 2, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloud-fan Jul 2, 2021 • edited Loading

Choose a reason for hiding this comment

SparkQA commented Jul 2, 2021

cloud-fan commented Jul 2, 2021

c21 commented Jul 2, 2021

SparkQA commented Jul 2, 2021

HyukjinKwon commented Jul 2, 2021

SparkQA commented Jul 2, 2021

SparkQA commented Jul 2, 2021

SparkQA commented Jul 2, 2021

SparkQA commented Jul 2, 2021

SparkQA commented Jul 2, 2021

c21 commented Jul 4, 2021

SparkQA commented Jul 4, 2021

SparkQA commented Jul 4, 2021

SparkQA commented Jul 4, 2021

HyukjinKwon Jul 5, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloud-fan commented Jul 5, 2021

c21 commented Jul 6, 2021

c21 commented Jun 17, 2021 •

edited

Loading

cloud-fan Jul 1, 2021 •

edited

Loading

cloud-fan Jul 2, 2021 •

edited

Loading

HyukjinKwon Jul 5, 2021 •

edited

Loading