-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-47177][SQL] Cached SQL plan do not display final AQE plan in explain string #45282
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Do we have golden files for ensuring the plan stability in this senario? |
|
It seems we did not have golden files for cache related query.. |
|
cc @cloud-fan as well |
| * 2. Generates the explain output for each subquery referenced in the plan. | ||
| */ | ||
| def processPlan[T <: QueryPlan[T]](plan: T, append: String => Unit): Unit = { | ||
| def processPlan[T <: QueryPlan[T]](plan: T, append: String => Unit): Unit = synchronized { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should add more comments to explain it. Ideally this is a no-op as different explain actions operate on different plan instances, but cached plan is an exception.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added comment
|
I think we can't remove the mutable states (TreeNodeTag) any time soon, we must live with it and the call sites should be careful when setting it. For EXPLAIN, my preference is to have a string formatter to produce EXPLAIN result, and the formatter implementation uses the visitor pattern and maintains states by itself, instead of using TreeNodeTag. But it's going to be a big change and I'm find with this short term fix by using lock. |
| append("\n") | ||
|
|
||
| if (innerChildren.nonEmpty) { | ||
| val innerChildrenLocal = innerChildren |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need a new local variable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is possbile innerChildren is not a variable, it is defined as def innerChildren
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
children is def too, but I don't see we create local variables...
Anyway, this is safer, I'm fine with it
| assert(findIMRInnerChild(df.queryExecution.executedPlan).treeString | ||
| .contains("AdaptiveSparkPlan isFinalPlan=false")) | ||
| df.collect() | ||
| assert(findIMRInnerChild(df.queryExecution.executedPlan).treeString |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
better to assert tree doesn't not contains any AdaptiveSparkPlan isFinalPlan=false. See the problematic treeString in https://issues.apache.org/jira/browse/SPARK-47177 also has a isFinalPlan=true in outer AQE plan, and a isFinalPlan=false in the inner AQE cached plan.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it would not contain outer plan, the tree string is from InMemoryRelation.innerChildren
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, thanks!
|
thanks for review, merging to master/branch-3.5 |
…xplain string ### What changes were proposed in this pull request? This pr adds lock for ExplainUtils.processPlan to avoid tag race condition. ### Why are the changes needed? To fix the issue [SPARK-47177](https://issues.apache.org/jira/browse/SPARK-47177) ### Does this PR introduce _any_ user-facing change? yes, affect plan explain ### How was this patch tested? add test ### Was this patch authored or co-authored using generative AI tooling? no Closes #45282 from ulysses-you/SPARK-47177. Authored-by: ulysses-you <ulyssesyou18@gmail.com> Signed-off-by: youxiduo <youxiduo@corp.netease.com> (cherry picked from commit 6e62a56) Signed-off-by: youxiduo <youxiduo@corp.netease.com>
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM.
|
BTW, #40812 landed at Apache Spark 3.4.1, doesn't it? If then, it seems that we need to backport this to branch-3.4, @ulysses-you . |
…xplain string This pr adds lock for ExplainUtils.processPlan to avoid tag race condition. To fix the issue [SPARK-47177](https://issues.apache.org/jira/browse/SPARK-47177) yes, affect plan explain add test no Closes apache#45282 from ulysses-you/SPARK-47177. Authored-by: ulysses-you <ulyssesyou18@gmail.com> Signed-off-by: youxiduo <youxiduo@corp.netease.com>
|
@dongjoon-hyun there are some conflicts, I created a new pr #45381 for branch-3.4 |
|
Thank you! That's better and safe. |
… in explain string This pr backport #45282 to branch-3.4 ### What changes were proposed in this pull request? This pr adds lock for ExplainUtils.processPlan to avoid tag race condition. ### Why are the changes needed? To fix the issue [SPARK-47177](https://issues.apache.org/jira/browse/SPARK-47177) ### Does this PR introduce _any_ user-facing change? yes, affect plan explain ### How was this patch tested? add test ### Was this patch authored or co-authored using generative AI tooling? no Closes #45381 from ulysses-you/SPARK-47177-3.4. Authored-by: ulysses-you <ulyssesyou18@gmail.com> Signed-off-by: youxiduo <youxiduo@corp.netease.com>
… in explain string This pr backport apache#45282 to branch-3.4 ### What changes were proposed in this pull request? This pr adds lock for ExplainUtils.processPlan to avoid tag race condition. ### Why are the changes needed? To fix the issue [SPARK-47177](https://issues.apache.org/jira/browse/SPARK-47177) ### Does this PR introduce _any_ user-facing change? yes, affect plan explain ### How was this patch tested? add test ### Was this patch authored or co-authored using generative AI tooling? no Closes apache#45381 from ulysses-you/SPARK-47177-3.4. Authored-by: ulysses-you <ulyssesyou18@gmail.com> Signed-off-by: youxiduo <youxiduo@corp.netease.com>
### What changes were proposed in this pull request? refactor: In `ExplainUtils.processPlan`, use auxiliary idMap instead of OP_ID_TAG ### Why are the changes needed? #45282 introduced synchronize to `ExplainUtils.processPlan` to avoid race condition when multiple queries refers to same cached plan. The granularity of lock is too large. We can try to fix the root cause of this concurrency issue by refactoring the usage of mutable `OP_ID_TAG`, which is not a good practice in terms of immutable nature of SparkPlan. Instead, we can use an auxiliary id map, with object identity as the key. The entire scope of `OP_ID_TAG` usage is within `ExplainUtils.processPlan`, therefore it's safe to do so, with thread local to make it available in other involved classes. ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? existing UTs. ### Was this patch authored or co-authored using generative AI tooling? NO Closes #46965 from liuzqt/SPARK-48610. Authored-by: Ziqi Liu <ziqi.liu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request? refactor: In `ExplainUtils.processPlan`, use auxiliary idMap instead of OP_ID_TAG ### Why are the changes needed? #45282 introduced synchronize to `ExplainUtils.processPlan` to avoid race condition when multiple queries refers to same cached plan. The granularity of lock is too large. We can try to fix the root cause of this concurrency issue by refactoring the usage of mutable `OP_ID_TAG`, which is not a good practice in terms of immutable nature of SparkPlan. Instead, we can use an auxiliary id map, with object identity as the key. The entire scope of `OP_ID_TAG` usage is within `ExplainUtils.processPlan`, therefore it's safe to do so, with thread local to make it available in other involved classes. ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? existing UTs. ### Was this patch authored or co-authored using generative AI tooling? NO Closes #46965 from liuzqt/SPARK-48610. Authored-by: Ziqi Liu <ziqi.liu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit d3da240) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
…P_ID_TAG ### What changes were proposed in this pull request? refactor: In `ExplainUtils.processPlan`, use auxiliary idMap instead of OP_ID_TAG ### Why are the changes needed? apache#45282 introduced synchronize to `ExplainUtils.processPlan` to avoid race condition when multiple queries refers to same cached plan. The granularity of lock is too large. We can try to fix the root cause of this concurrency issue by refactoring the usage of mutable `OP_ID_TAG`, which is not a good practice in terms of immutable nature of SparkPlan. Instead, we can use an auxiliary id map, with object identity as the key. The entire scope of `OP_ID_TAG` usage is within `ExplainUtils.processPlan`, therefore it's safe to do so, with thread local to make it available in other involved classes. ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? existing UTs. ### Was this patch authored or co-authored using generative AI tooling? NO Closes apache#46965 from liuzqt/SPARK-48610. Authored-by: Ziqi Liu <ziqi.liu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit d3da240) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
…P_ID_TAG ### What changes were proposed in this pull request? refactor: In `ExplainUtils.processPlan`, use auxiliary idMap instead of OP_ID_TAG ### Why are the changes needed? apache#45282 introduced synchronize to `ExplainUtils.processPlan` to avoid race condition when multiple queries refers to same cached plan. The granularity of lock is too large. We can try to fix the root cause of this concurrency issue by refactoring the usage of mutable `OP_ID_TAG`, which is not a good practice in terms of immutable nature of SparkPlan. Instead, we can use an auxiliary id map, with object identity as the key. The entire scope of `OP_ID_TAG` usage is within `ExplainUtils.processPlan`, therefore it's safe to do so, with thread local to make it available in other involved classes. ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? existing UTs. ### Was this patch authored or co-authored using generative AI tooling? NO Closes apache#46965 from liuzqt/SPARK-48610. Authored-by: Ziqi Liu <ziqi.liu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit d3da240) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
…P_ID_TAG ### What changes were proposed in this pull request? refactor: In `ExplainUtils.processPlan`, use auxiliary idMap instead of OP_ID_TAG ### Why are the changes needed? apache#45282 introduced synchronize to `ExplainUtils.processPlan` to avoid race condition when multiple queries refers to same cached plan. The granularity of lock is too large. We can try to fix the root cause of this concurrency issue by refactoring the usage of mutable `OP_ID_TAG`, which is not a good practice in terms of immutable nature of SparkPlan. Instead, we can use an auxiliary id map, with object identity as the key. The entire scope of `OP_ID_TAG` usage is within `ExplainUtils.processPlan`, therefore it's safe to do so, with thread local to make it available in other involved classes. ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? existing UTs. ### Was this patch authored or co-authored using generative AI tooling? NO Closes apache#46965 from liuzqt/SPARK-48610. Authored-by: Ziqi Liu <ziqi.liu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit d3da240) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
…xplain string ### What changes were proposed in this pull request? This pr adds lock for ExplainUtils.processPlan to avoid tag race condition. ### Why are the changes needed? To fix the issue [SPARK-47177](https://issues.apache.org/jira/browse/SPARK-47177) ### Does this PR introduce _any_ user-facing change? yes, affect plan explain ### How was this patch tested? add test ### Was this patch authored or co-authored using generative AI tooling? no Closes apache#45282 from ulysses-you/SPARK-47177. Authored-by: ulysses-you <ulyssesyou18@gmail.com> Signed-off-by: youxiduo <youxiduo@corp.netease.com> (cherry picked from commit 6e62a56) Signed-off-by: youxiduo <youxiduo@corp.netease.com>
…apache#626) ### What changes were proposed in this pull request? refactor: In `ExplainUtils.processPlan`, use auxiliary idMap instead of OP_ID_TAG ### Why are the changes needed? apache#45282 introduced synchronize to `ExplainUtils.processPlan` to avoid race condition when multiple queries refers to same cached plan. The granularity of lock is too large. We can try to fix the root cause of this concurrency issue by refactoring the usage of mutable `OP_ID_TAG`, which is not a good practice in terms of immutable nature of SparkPlan. Instead, we can use an auxiliary id map, with object identity as the key. The entire scope of `OP_ID_TAG` usage is within `ExplainUtils.processPlan`, therefore it's safe to do so, with thread local to make it available in other involved classes. ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? existing UTs. ### Was this patch authored or co-authored using generative AI tooling? NO Closes apache#46965 from liuzqt/SPARK-48610. Authored-by: Ziqi Liu <ziqi.liu@databricks.com> (cherry picked from commit d3da240) Signed-off-by: Wenchen Fan <wenchen@databricks.com> Co-authored-by: Ziqi Liu <ziqi.liu@databricks.com>
What changes were proposed in this pull request?
This pr adds lock for ExplainUtils.processPlan to avoid tag race condition.
Why are the changes needed?
To fix the issue SPARK-47177
Does this PR introduce any user-facing change?
yes, affect plan explain
How was this patch tested?
add test
Was this patch authored or co-authored using generative AI tooling?
no