-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-46760][SQL][DOCS] Make the document of spark.sql.adaptive.coalescePartitions.parallelismFirst clearer #44787
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor suggestion:
| "regression when enabling adaptive query execution. If you respect the configured " + | |
| "target size, please set this config to false.") | |
| "regressions when enabling adaptive query execution. To respect the configured " + | |
| "target size, please set this config to false.") |
|
Note that until we adopt some approach to automate the generation of config tables in our docs (e.g. like in #44755 or #44756), you will need to manually update the HTML here so it stays in sync with the source: spark/docs/sql-performance-tuning.md Line 270 in 977f64f
|
…escePartitions.parallelismFirst clearer
|
cc @cloud-fan |
| "shuffle partitions, but adaptively calculate the target size according to the default " + | ||
| "parallelism of the Spark cluster. The calculated size is usually smaller than the " + | ||
| "configured target size. This is to maximize the parallelism and avoid performance " + | ||
| "regression when enabling adaptive query execution. It's recommended to set this config " + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@maryannxue is it really recommended?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe just say It's recommended to set this config to true on a busy cluster to make resource utilization more efficient (not many small tasks).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This suggestion contains a mistake right? It should be set to false in a busy cluster? #45437
srowen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree. I'm not sure why it says "it's recommended to be set to false" when that's not the default and 'true' is the safer default.
But this text still really doesn't help you decide anything. What does 'parallelism first' even mean? we can't change that name, but we can explain it.
Instead of all of the text starting at "This is to avoid ...", which isn't even that accurate (you can have a perf problem either way), let's continue:
"This is helpful where even small partitions with small data size require a large amount of computation, and so coalescing the small partitions reduces parallelism and harms performance. In more typical cases where this is not true, coalescing partitions can avoid many tiny tasks and improve performance, and so this config can be set to false"
This is very well. |
|
This is true by default because in benchmarks we run one query at a time and it can use all the resources of the entire cluster. So parallelism is more important, as if we pick 64mb as the target size and save some CPU slots, no other tasks will use these free slots. |
|
(Re run the tests) |
|
Merged to master |
|
@srowen @cloud-fan Thank you! |
…First This changes the second `true` to `false` to make the doc comment correct. A setting of false will mean to not prioritize parallism and that will lead to less small tasks. Seems like it incorrectness was introduced here: #44787 ### What changes were proposed in this pull request? Documentation fix. ### Why are the changes needed? Current doc is wrong. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? N/A ### Was this patch authored or co-authored using generative AI tooling? No. Closes #45437 from eejbyfeldt/fix-minor-doc-comment-on-parallism-first. Authored-by: Emil Ejbyfeldt <eejbyfeldt@liveintent.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
What changes were proposed in this pull request?
This PR propose to make the document of
spark.sql.adaptive.coalescePartitions.parallelismFirstclearer.Why are the changes needed?
The default value of
spark.sql.adaptive.coalescePartitions.parallelismFirstis true, but the document contains the wordrecommended to set this config to false and respect the configured target size. It's very confused.Does this PR introduce any user-facing change?
'Yes'.
The document is more clear.
How was this patch tested?
N/A
Was this patch authored or co-authored using generative AI tooling?
'No'.