Skip to content

Conversation

@ulysses-you
Copy link
Contributor

@ulysses-you ulysses-you commented Feb 27, 2024

What changes were proposed in this pull request?

This pr adds lock for ExplainUtils.processPlan to avoid tag race condition.

Why are the changes needed?

To fix the issue SPARK-47177

Does this PR introduce any user-facing change?

yes, affect plan explain

How was this patch tested?

add test

Was this patch authored or co-authored using generative AI tooling?

no

@github-actions github-actions bot added the SQL label Feb 27, 2024
@yaooqinn
Copy link
Member

Do we have golden files for ensuring the plan stability in this senario?

@ulysses-you
Copy link
Contributor Author

It seems we did not have golden files for cache related query..

@ulysses-you
Copy link
Contributor Author

cc @cloud-fan as well

* 2. Generates the explain output for each subquery referenced in the plan.
*/
def processPlan[T <: QueryPlan[T]](plan: T, append: String => Unit): Unit = {
def processPlan[T <: QueryPlan[T]](plan: T, append: String => Unit): Unit = synchronized {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should add more comments to explain it. Ideally this is a no-op as different explain actions operate on different plan instances, but cached plan is an exception.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added comment

@cloud-fan
Copy link
Contributor

cc @robreeves @liuzqt

@cloud-fan
Copy link
Contributor

I think we can't remove the mutable states (TreeNodeTag) any time soon, we must live with it and the call sites should be careful when setting it. For EXPLAIN, my preference is to have a string formatter to produce EXPLAIN result, and the formatter implementation uses the visitor pattern and maintains states by itself, instead of using TreeNodeTag. But it's going to be a big change and I'm find with this short term fix by using lock.

append("\n")

if (innerChildren.nonEmpty) {
val innerChildrenLocal = innerChildren
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need a new local variable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is possbile innerChildren is not a variable, it is defined as def innerChildren

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

children is def too, but I don't see we create local variables...

Anyway, this is safer, I'm fine with it

assert(findIMRInnerChild(df.queryExecution.executedPlan).treeString
.contains("AdaptiveSparkPlan isFinalPlan=false"))
df.collect()
assert(findIMRInnerChild(df.queryExecution.executedPlan).treeString
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better to assert tree doesn't not contains any AdaptiveSparkPlan isFinalPlan=false. See the problematic treeString in https://issues.apache.org/jira/browse/SPARK-47177 also has a isFinalPlan=true in outer AQE plan, and a isFinalPlan=false in the inner AQE cached plan.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would not contain outer plan, the tree string is from InMemoryRelation.innerChildren

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, thanks!

@ulysses-you
Copy link
Contributor Author

ulysses-you commented Mar 5, 2024

thanks for review, merging to master/branch-3.5

ulysses-you added a commit that referenced this pull request Mar 5, 2024
…xplain string

### What changes were proposed in this pull request?

This pr adds lock for ExplainUtils.processPlan to avoid tag race condition.

### Why are the changes needed?

To fix the issue [SPARK-47177](https://issues.apache.org/jira/browse/SPARK-47177)

### Does this PR introduce _any_ user-facing change?

yes, affect plan explain

### How was this patch tested?

add test

### Was this patch authored or co-authored using generative AI tooling?

no

Closes #45282 from ulysses-you/SPARK-47177.

Authored-by: ulysses-you <ulyssesyou18@gmail.com>
Signed-off-by: youxiduo <youxiduo@corp.netease.com>
(cherry picked from commit 6e62a56)
Signed-off-by: youxiduo <youxiduo@corp.netease.com>
@ulysses-you ulysses-you deleted the SPARK-47177 branch March 5, 2024 02:15
Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM.

@dongjoon-hyun
Copy link
Member

BTW, #40812 landed at Apache Spark 3.4.1, doesn't it? If then, it seems that we need to backport this to branch-3.4, @ulysses-you .

ulysses-you added a commit to ulysses-you/spark that referenced this pull request Mar 5, 2024
…xplain string

This pr adds lock for ExplainUtils.processPlan to avoid tag race condition.

To fix the issue [SPARK-47177](https://issues.apache.org/jira/browse/SPARK-47177)

yes, affect plan explain

add test

no

Closes apache#45282 from ulysses-you/SPARK-47177.

Authored-by: ulysses-you <ulyssesyou18@gmail.com>
Signed-off-by: youxiduo <youxiduo@corp.netease.com>
@ulysses-you
Copy link
Contributor Author

@dongjoon-hyun there are some conflicts, I created a new pr #45381 for branch-3.4

@dongjoon-hyun
Copy link
Member

Thank you! That's better and safe.

ulysses-you added a commit that referenced this pull request Mar 5, 2024
… in explain string

This pr backport #45282 to branch-3.4

### What changes were proposed in this pull request?

This pr adds lock for ExplainUtils.processPlan to avoid tag race condition.

### Why are the changes needed?

To fix the issue [SPARK-47177](https://issues.apache.org/jira/browse/SPARK-47177)

### Does this PR introduce _any_ user-facing change?

yes, affect plan explain

### How was this patch tested?

add test

### Was this patch authored or co-authored using generative AI tooling?

no

Closes #45381 from ulysses-you/SPARK-47177-3.4.

Authored-by: ulysses-you <ulyssesyou18@gmail.com>
Signed-off-by: youxiduo <youxiduo@corp.netease.com>
snmvaughan pushed a commit to snmvaughan/spark that referenced this pull request Mar 26, 2024
… in explain string

This pr backport apache#45282 to branch-3.4

### What changes were proposed in this pull request?

This pr adds lock for ExplainUtils.processPlan to avoid tag race condition.

### Why are the changes needed?

To fix the issue [SPARK-47177](https://issues.apache.org/jira/browse/SPARK-47177)

### Does this PR introduce _any_ user-facing change?

yes, affect plan explain

### How was this patch tested?

add test

### Was this patch authored or co-authored using generative AI tooling?

no

Closes apache#45381 from ulysses-you/SPARK-47177-3.4.

Authored-by: ulysses-you <ulyssesyou18@gmail.com>
Signed-off-by: youxiduo <youxiduo@corp.netease.com>
cloud-fan pushed a commit that referenced this pull request Jun 17, 2024
### What changes were proposed in this pull request?

refactor: In `ExplainUtils.processPlan`, use auxiliary idMap instead of OP_ID_TAG

### Why are the changes needed?

#45282 introduced synchronize to `ExplainUtils.processPlan`  to avoid race condition when multiple queries refers to same cached plan.

The granularity of lock is too large. We can try to fix the root cause of this concurrency issue by refactoring the usage of mutable `OP_ID_TAG`, which is not a good practice in terms of immutable nature of SparkPlan.

Instead, we can use an auxiliary id map, with object identity as the key. The entire scope of `OP_ID_TAG` usage is within `ExplainUtils.processPlan`, therefore it's safe to do so, with thread local to make it available in other involved classes.

### Does this PR introduce _any_ user-facing change?
  NO

### How was this patch tested?
existing UTs.

### Was this patch authored or co-authored using generative AI tooling?
NO

Closes #46965 from liuzqt/SPARK-48610.

Authored-by: Ziqi Liu <ziqi.liu@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
cloud-fan pushed a commit that referenced this pull request Jun 17, 2024
### What changes were proposed in this pull request?

refactor: In `ExplainUtils.processPlan`, use auxiliary idMap instead of OP_ID_TAG

### Why are the changes needed?

#45282 introduced synchronize to `ExplainUtils.processPlan`  to avoid race condition when multiple queries refers to same cached plan.

The granularity of lock is too large. We can try to fix the root cause of this concurrency issue by refactoring the usage of mutable `OP_ID_TAG`, which is not a good practice in terms of immutable nature of SparkPlan.

Instead, we can use an auxiliary id map, with object identity as the key. The entire scope of `OP_ID_TAG` usage is within `ExplainUtils.processPlan`, therefore it's safe to do so, with thread local to make it available in other involved classes.

### Does this PR introduce _any_ user-facing change?
  NO

### How was this patch tested?
existing UTs.

### Was this patch authored or co-authored using generative AI tooling?
NO

Closes #46965 from liuzqt/SPARK-48610.

Authored-by: Ziqi Liu <ziqi.liu@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit d3da240)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
senthh pushed a commit to acceldata-io/spark3 that referenced this pull request Aug 23, 2025
…P_ID_TAG

### What changes were proposed in this pull request?

refactor: In `ExplainUtils.processPlan`, use auxiliary idMap instead of OP_ID_TAG

### Why are the changes needed?

apache#45282 introduced synchronize to `ExplainUtils.processPlan`  to avoid race condition when multiple queries refers to same cached plan.

The granularity of lock is too large. We can try to fix the root cause of this concurrency issue by refactoring the usage of mutable `OP_ID_TAG`, which is not a good practice in terms of immutable nature of SparkPlan.

Instead, we can use an auxiliary id map, with object identity as the key. The entire scope of `OP_ID_TAG` usage is within `ExplainUtils.processPlan`, therefore it's safe to do so, with thread local to make it available in other involved classes.

### Does this PR introduce _any_ user-facing change?
  NO

### How was this patch tested?
existing UTs.

### Was this patch authored or co-authored using generative AI tooling?
NO

Closes apache#46965 from liuzqt/SPARK-48610.

Authored-by: Ziqi Liu <ziqi.liu@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit d3da240)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
senthh pushed a commit to acceldata-io/spark3 that referenced this pull request Oct 31, 2025
…P_ID_TAG

### What changes were proposed in this pull request?

refactor: In `ExplainUtils.processPlan`, use auxiliary idMap instead of OP_ID_TAG

### Why are the changes needed?

apache#45282 introduced synchronize to `ExplainUtils.processPlan`  to avoid race condition when multiple queries refers to same cached plan.

The granularity of lock is too large. We can try to fix the root cause of this concurrency issue by refactoring the usage of mutable `OP_ID_TAG`, which is not a good practice in terms of immutable nature of SparkPlan.

Instead, we can use an auxiliary id map, with object identity as the key. The entire scope of `OP_ID_TAG` usage is within `ExplainUtils.processPlan`, therefore it's safe to do so, with thread local to make it available in other involved classes.

### Does this PR introduce _any_ user-facing change?
  NO

### How was this patch tested?
existing UTs.

### Was this patch authored or co-authored using generative AI tooling?
NO

Closes apache#46965 from liuzqt/SPARK-48610.

Authored-by: Ziqi Liu <ziqi.liu@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit d3da240)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
senthh pushed a commit to acceldata-io/spark3 that referenced this pull request Nov 3, 2025
…P_ID_TAG

### What changes were proposed in this pull request?

refactor: In `ExplainUtils.processPlan`, use auxiliary idMap instead of OP_ID_TAG

### Why are the changes needed?

apache#45282 introduced synchronize to `ExplainUtils.processPlan`  to avoid race condition when multiple queries refers to same cached plan.

The granularity of lock is too large. We can try to fix the root cause of this concurrency issue by refactoring the usage of mutable `OP_ID_TAG`, which is not a good practice in terms of immutable nature of SparkPlan.

Instead, we can use an auxiliary id map, with object identity as the key. The entire scope of `OP_ID_TAG` usage is within `ExplainUtils.processPlan`, therefore it's safe to do so, with thread local to make it available in other involved classes.

### Does this PR introduce _any_ user-facing change?
  NO

### How was this patch tested?
existing UTs.

### Was this patch authored or co-authored using generative AI tooling?
NO

Closes apache#46965 from liuzqt/SPARK-48610.

Authored-by: Ziqi Liu <ziqi.liu@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit d3da240)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
turboFei pushed a commit to turboFei/spark that referenced this pull request Nov 6, 2025
…xplain string

### What changes were proposed in this pull request?

This pr adds lock for ExplainUtils.processPlan to avoid tag race condition.

### Why are the changes needed?

To fix the issue [SPARK-47177](https://issues.apache.org/jira/browse/SPARK-47177)

### Does this PR introduce _any_ user-facing change?

yes, affect plan explain

### How was this patch tested?

add test

### Was this patch authored or co-authored using generative AI tooling?

no

Closes apache#45282 from ulysses-you/SPARK-47177.

Authored-by: ulysses-you <ulyssesyou18@gmail.com>
Signed-off-by: youxiduo <youxiduo@corp.netease.com>
(cherry picked from commit 6e62a56)
Signed-off-by: youxiduo <youxiduo@corp.netease.com>
turboFei pushed a commit to turboFei/spark that referenced this pull request Nov 6, 2025
…apache#626)

### What changes were proposed in this pull request?

refactor: In `ExplainUtils.processPlan`, use auxiliary idMap instead of OP_ID_TAG

### Why are the changes needed?

apache#45282 introduced synchronize to `ExplainUtils.processPlan`  to avoid race condition when multiple queries refers to same cached plan.

The granularity of lock is too large. We can try to fix the root cause of this concurrency issue by refactoring the usage of mutable `OP_ID_TAG`, which is not a good practice in terms of immutable nature of SparkPlan.

Instead, we can use an auxiliary id map, with object identity as the key. The entire scope of `OP_ID_TAG` usage is within `ExplainUtils.processPlan`, therefore it's safe to do so, with thread local to make it available in other involved classes.

### Does this PR introduce _any_ user-facing change?
  NO

### How was this patch tested?
existing UTs.

### Was this patch authored or co-authored using generative AI tooling?
NO

Closes apache#46965 from liuzqt/SPARK-48610.

Authored-by: Ziqi Liu <ziqi.liu@databricks.com>

(cherry picked from commit d3da240)

Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Co-authored-by: Ziqi Liu <ziqi.liu@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants