[SPARK-47177][SQL] Cached SQL plan do not display final AQE plan in explain string #45282

ulysses-you · 2024-02-27T09:44:30Z

What changes were proposed in this pull request?

This pr adds lock for ExplainUtils.processPlan to avoid tag race condition.

Why are the changes needed?

To fix the issue SPARK-47177

Does this PR introduce any user-facing change?

yes, affect plan explain

How was this patch tested?

add test

Was this patch authored or co-authored using generative AI tooling?

no

yaooqinn · 2024-02-27T11:05:46Z

Do we have golden files for ensuring the plan stability in this senario？

ulysses-you · 2024-02-27T11:13:52Z

It seems we did not have golden files for cache related query..

ulysses-you · 2024-02-28T01:25:05Z

cc @cloud-fan as well

cloud-fan · 2024-03-04T05:35:10Z

sql/core/src/main/scala/org/apache/spark/sql/execution/ExplainUtils.scala

   *   2. Generates the explain output for each subquery referenced in the plan.
   */
-  def processPlan[T <: QueryPlan[T]](plan: T, append: String => Unit): Unit = {
+  def processPlan[T <: QueryPlan[T]](plan: T, append: String => Unit): Unit = synchronized {


We should add more comments to explain it. Ideally this is a no-op as different explain actions operate on different plan instances, but cached plan is an exception.

added comment

cloud-fan · 2024-03-04T05:36:31Z

cc @robreeves @liuzqt

cloud-fan · 2024-03-04T05:39:50Z

I think we can't remove the mutable states (TreeNodeTag) any time soon, we must live with it and the call sites should be careful when setting it. For EXPLAIN, my preference is to have a string formatter to produce EXPLAIN result, and the formatter implementation uses the visitor pattern and maintains states by itself, instead of using TreeNodeTag. But it's going to be a big change and I'm find with this short term fix by using lock.

cloud-fan · 2024-03-04T05:40:35Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreeNode.scala

    append("\n")

-    if (innerChildren.nonEmpty) {
+    val innerChildrenLocal = innerChildren


why do we need a new local variable?

it is possbile innerChildren is not a variable, it is defined as def innerChildren

children is def too, but I don't see we create local variables...

Anyway, this is safer, I'm fine with it

liuzqt · 2024-03-04T18:46:26Z

sql/core/src/test/scala/org/apache/spark/sql/execution/columnar/InMemoryRelationSuite.scala

+    assert(findIMRInnerChild(df.queryExecution.executedPlan).treeString
+      .contains("AdaptiveSparkPlan isFinalPlan=false"))
+    df.collect()
+    assert(findIMRInnerChild(df.queryExecution.executedPlan).treeString


better to assert tree doesn't not contains any AdaptiveSparkPlan isFinalPlan=false. See the problematic treeString in https://issues.apache.org/jira/browse/SPARK-47177 also has a isFinalPlan=true in outer AQE plan, and a isFinalPlan=false in the inner AQE cached plan.

it would not contain outer plan, the tree string is from InMemoryRelation.innerChildren

I see, thanks!

ulysses-you · 2024-03-05T02:13:10Z

thanks for review, merging to master/branch-3.5

…xplain string ### What changes were proposed in this pull request? This pr adds lock for ExplainUtils.processPlan to avoid tag race condition. ### Why are the changes needed? To fix the issue [SPARK-47177](https://issues.apache.org/jira/browse/SPARK-47177) ### Does this PR introduce _any_ user-facing change? yes, affect plan explain ### How was this patch tested? add test ### Was this patch authored or co-authored using generative AI tooling? no Closes #45282 from ulysses-you/SPARK-47177. Authored-by: ulysses-you <ulyssesyou18@gmail.com> Signed-off-by: youxiduo <youxiduo@corp.netease.com> (cherry picked from commit 6e62a56) Signed-off-by: youxiduo <youxiduo@corp.netease.com>

dongjoon-hyun

+1, LGTM.

dongjoon-hyun · 2024-03-05T02:22:30Z

BTW, #40812 landed at Apache Spark 3.4.1, doesn't it? If then, it seems that we need to backport this to branch-3.4, @ulysses-you .

…xplain string This pr adds lock for ExplainUtils.processPlan to avoid tag race condition. To fix the issue [SPARK-47177](https://issues.apache.org/jira/browse/SPARK-47177) yes, affect plan explain add test no Closes apache#45282 from ulysses-you/SPARK-47177. Authored-by: ulysses-you <ulyssesyou18@gmail.com> Signed-off-by: youxiduo <youxiduo@corp.netease.com>

ulysses-you · 2024-03-05T02:32:09Z

@dongjoon-hyun there are some conflicts, I created a new pr #45381 for branch-3.4

dongjoon-hyun · 2024-03-05T02:32:56Z

Thank you! That's better and safe.

… in explain string This pr backport #45282 to branch-3.4 ### What changes were proposed in this pull request? This pr adds lock for ExplainUtils.processPlan to avoid tag race condition. ### Why are the changes needed? To fix the issue [SPARK-47177](https://issues.apache.org/jira/browse/SPARK-47177) ### Does this PR introduce _any_ user-facing change? yes, affect plan explain ### How was this patch tested? add test ### Was this patch authored or co-authored using generative AI tooling? no Closes #45381 from ulysses-you/SPARK-47177-3.4. Authored-by: ulysses-you <ulyssesyou18@gmail.com> Signed-off-by: youxiduo <youxiduo@corp.netease.com>

… in explain string This pr backport apache#45282 to branch-3.4 ### What changes were proposed in this pull request? This pr adds lock for ExplainUtils.processPlan to avoid tag race condition. ### Why are the changes needed? To fix the issue [SPARK-47177](https://issues.apache.org/jira/browse/SPARK-47177) ### Does this PR introduce _any_ user-facing change? yes, affect plan explain ### How was this patch tested? add test ### Was this patch authored or co-authored using generative AI tooling? no Closes apache#45381 from ulysses-you/SPARK-47177-3.4. Authored-by: ulysses-you <ulyssesyou18@gmail.com> Signed-off-by: youxiduo <youxiduo@corp.netease.com>

### What changes were proposed in this pull request? refactor: In `ExplainUtils.processPlan`, use auxiliary idMap instead of OP_ID_TAG ### Why are the changes needed? #45282 introduced synchronize to `ExplainUtils.processPlan` to avoid race condition when multiple queries refers to same cached plan. The granularity of lock is too large. We can try to fix the root cause of this concurrency issue by refactoring the usage of mutable `OP_ID_TAG`, which is not a good practice in terms of immutable nature of SparkPlan. Instead, we can use an auxiliary id map, with object identity as the key. The entire scope of `OP_ID_TAG` usage is within `ExplainUtils.processPlan`, therefore it's safe to do so, with thread local to make it available in other involved classes. ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? existing UTs. ### Was this patch authored or co-authored using generative AI tooling? NO Closes #46965 from liuzqt/SPARK-48610. Authored-by: Ziqi Liu <ziqi.liu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

### What changes were proposed in this pull request? refactor: In `ExplainUtils.processPlan`, use auxiliary idMap instead of OP_ID_TAG ### Why are the changes needed? #45282 introduced synchronize to `ExplainUtils.processPlan` to avoid race condition when multiple queries refers to same cached plan. The granularity of lock is too large. We can try to fix the root cause of this concurrency issue by refactoring the usage of mutable `OP_ID_TAG`, which is not a good practice in terms of immutable nature of SparkPlan. Instead, we can use an auxiliary id map, with object identity as the key. The entire scope of `OP_ID_TAG` usage is within `ExplainUtils.processPlan`, therefore it's safe to do so, with thread local to make it available in other involved classes. ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? existing UTs. ### Was this patch authored or co-authored using generative AI tooling? NO Closes #46965 from liuzqt/SPARK-48610. Authored-by: Ziqi Liu <ziqi.liu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit d3da240) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

…P_ID_TAG ### What changes were proposed in this pull request? refactor: In `ExplainUtils.processPlan`, use auxiliary idMap instead of OP_ID_TAG ### Why are the changes needed? apache#45282 introduced synchronize to `ExplainUtils.processPlan` to avoid race condition when multiple queries refers to same cached plan. The granularity of lock is too large. We can try to fix the root cause of this concurrency issue by refactoring the usage of mutable `OP_ID_TAG`, which is not a good practice in terms of immutable nature of SparkPlan. Instead, we can use an auxiliary id map, with object identity as the key. The entire scope of `OP_ID_TAG` usage is within `ExplainUtils.processPlan`, therefore it's safe to do so, with thread local to make it available in other involved classes. ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? existing UTs. ### Was this patch authored or co-authored using generative AI tooling? NO Closes apache#46965 from liuzqt/SPARK-48610. Authored-by: Ziqi Liu <ziqi.liu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit d3da240) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

…xplain string ### What changes were proposed in this pull request? This pr adds lock for ExplainUtils.processPlan to avoid tag race condition. ### Why are the changes needed? To fix the issue [SPARK-47177](https://issues.apache.org/jira/browse/SPARK-47177) ### Does this PR introduce _any_ user-facing change? yes, affect plan explain ### How was this patch tested? add test ### Was this patch authored or co-authored using generative AI tooling? no Closes apache#45282 from ulysses-you/SPARK-47177. Authored-by: ulysses-you <ulyssesyou18@gmail.com> Signed-off-by: youxiduo <youxiduo@corp.netease.com> (cherry picked from commit 6e62a56) Signed-off-by: youxiduo <youxiduo@corp.netease.com>

…apache#626) ### What changes were proposed in this pull request? refactor: In `ExplainUtils.processPlan`, use auxiliary idMap instead of OP_ID_TAG ### Why are the changes needed? apache#45282 introduced synchronize to `ExplainUtils.processPlan` to avoid race condition when multiple queries refers to same cached plan. The granularity of lock is too large. We can try to fix the root cause of this concurrency issue by refactoring the usage of mutable `OP_ID_TAG`, which is not a good practice in terms of immutable nature of SparkPlan. Instead, we can use an auxiliary id map, with object identity as the key. The entire scope of `OP_ID_TAG` usage is within `ExplainUtils.processPlan`, therefore it's safe to do so, with thread local to make it available in other involved classes. ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? existing UTs. ### Was this patch authored or co-authored using generative AI tooling? NO Closes apache#46965 from liuzqt/SPARK-48610. Authored-by: Ziqi Liu <ziqi.liu@databricks.com> (cherry picked from commit d3da240) Signed-off-by: Wenchen Fan <wenchen@databricks.com> Co-authored-by: Ziqi Liu <ziqi.liu@databricks.com>

github-actions bot added the SQL label Feb 27, 2024

Cached SQL plan do not display final AQE plan in explain string

0f28449

ulysses-you force-pushed the SPARK-47177 branch from 6820a24 to 0f28449 Compare March 1, 2024 03:17

cloud-fan reviewed Mar 4, 2024

View reviewed changes

cloud-fan approved these changes Mar 4, 2024

View reviewed changes

cloud-fan reviewed Mar 4, 2024

View reviewed changes

add comment

1de8299

liuzqt reviewed Mar 4, 2024

View reviewed changes

liuzqt approved these changes Mar 5, 2024

View reviewed changes

ulysses-you closed this in 6e62a56 Mar 5, 2024

ulysses-you deleted the SPARK-47177 branch March 5, 2024 02:15

dongjoon-hyun reviewed Mar 5, 2024

View reviewed changes

ulysses-you mentioned this pull request Mar 5, 2024

[SPARK-47177][SQL][3.4] Cached SQL plan do not display final AQE plan in explain string #45381

Closed

ulysses-you mentioned this pull request Mar 7, 2024

[CORE] Add synchronized for ExplainUtils processPlan apache/incubator-gluten#4876

Merged

liuzqt mentioned this pull request Jun 13, 2024

[SPARK-48610][SQL] refactor: use auxiliary idMap instead of OP_ID_TAG #46965

Closed

[SPARK-47177][SQL] Cached SQL plan do not display final AQE plan in explain string #45282

[SPARK-47177][SQL] Cached SQL plan do not display final AQE plan in explain string #45282

Uh oh!

Conversation

ulysses-you commented Feb 27, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

yaooqinn commented Feb 27, 2024

Uh oh!

ulysses-you commented Feb 27, 2024

Uh oh!

ulysses-you commented Feb 28, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Mar 4, 2024

Uh oh!

cloud-fan commented Mar 4, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ulysses-you commented Mar 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun commented Mar 5, 2024

Uh oh!

ulysses-you commented Mar 5, 2024

Uh oh!

dongjoon-hyun commented Mar 5, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

ulysses-you commented Feb 27, 2024 •

edited

Loading

ulysses-you commented Mar 5, 2024 •

edited

Loading