[SPARK-46760][SQL][DOCS] Make the document of spark.sql.adaptive.coalescePartitions.parallelismFirst clearer #44787

beliefer · 2024-01-18T13:01:43Z

What changes were proposed in this pull request?

This PR propose to make the document of spark.sql.adaptive.coalescePartitions.parallelismFirst clearer.

Why are the changes needed?

The default value of spark.sql.adaptive.coalescePartitions.parallelismFirst is true, but the document contains the word recommended to set this config to false and respect the configured target size. It's very confused.

Does this PR introduce any user-facing change?

'Yes'.
The document is more clear.

How was this patch tested?

N/A

Was this patch authored or co-authored using generative AI tooling?

'No'.

nchammas · 2024-01-18T14:17:24Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

Minor suggestion:

Suggested change

"regression when enabling adaptive query execution. If you respect the configured " +

"target size, please set this config to false.")

"regressions when enabling adaptive query execution. To respect the configured " +

"target size, please set this config to false.")

nchammas · 2024-01-18T14:22:09Z

Note that until we adopt some approach to automate the generation of config tables in our docs (e.g. like in #44755 or #44756), you will need to manually update the HTML here so it stays in sync with the source:

spark/docs/sql-performance-tuning.md

Line 270 in 977f64f

    
              When true, Spark ignores the target size specified by <code>spark.sql.adaptive.advisoryPartitionSizeInBytes</code> (default 64MB) when coalescing contiguous shuffle partitions, and only respect the minimum partition size specified by <code>spark.sql.adaptive.coalescePartitions.minPartitionSize</code> (default 1MB), to maximize the parallelism. This is to avoid performance regression when enabling adaptive query execution. It's recommended to set this config to false and respect the target size specified by <code>spark.sql.adaptive.advisoryPartitionSizeInBytes</code>.

…escePartitions.parallelismFirst clearer

beliefer · 2024-01-21T09:01:33Z

cc @cloud-fan

beliefer · 2024-01-24T11:14:52Z

cc @MaxGekk @gengliangwang

cloud-fan · 2024-01-24T13:42:48Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

        "shuffle partitions, but adaptively calculate the target size according to the default " +
        "parallelism of the Spark cluster. The calculated size is usually smaller than the " +
        "configured target size. This is to maximize the parallelism and avoid performance " +
-        "regression when enabling adaptive query execution. It's recommended to set this config " +


@maryannxue is it really recommended?

maybe just say It's recommended to set this config to true on a busy cluster to make resource utilization more efficient (not many small tasks).

This suggestion contains a mistake right? It should be set to false in a busy cluster? #45437

srowen

Agree. I'm not sure why it says "it's recommended to be set to false" when that's not the default and 'true' is the safer default.

But this text still really doesn't help you decide anything. What does 'parallelism first' even mean? we can't change that name, but we can explain it.

Instead of all of the text starting at "This is to avoid ...", which isn't even that accurate (you can have a perf problem either way), let's continue:

"This is helpful where even small partitions with small data size require a large amount of computation, and so coalescing the small partitions reduces parallelism and harms performance. In more typical cases where this is not true, coalescing partitions can avoid many tiny tasks and improve performance, and so this config can be set to false"

beliefer · 2024-01-31T01:16:07Z

"This is helpful where even small partitions with small data size require a large amount of computation, and so coalescing the small partitions reduces parallelism and harms performance. In more typical cases where this is not true, coalescing partitions can avoid many tiny tasks and improve performance, and so this config can be set to false"

This is very well.

cloud-fan · 2024-02-01T03:44:12Z

This is true by default because in benchmarks we run one query at a time and it can use all the resources of the entire cluster. So parallelism is more important, as if we pick 64mb as the target size and save some CPU slots, no other tasks will use these free slots.

srowen · 2024-02-01T04:58:27Z

(Re run the tests)
Is the text OK @cloud-fan or are you suggesting a modification?

srowen · 2024-02-03T15:07:17Z

Merged to master

beliefer · 2024-02-04T01:55:41Z

@srowen @cloud-fan Thank you!

…First This changes the second `true` to `false` to make the doc comment correct. A setting of false will mean to not prioritize parallism and that will lead to less small tasks. Seems like it incorrectness was introduced here: #44787 ### What changes were proposed in this pull request? Documentation fix. ### Why are the changes needed? Current doc is wrong. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? N/A ### Was this patch authored or co-authored using generative AI tooling? No. Closes #45437 from eejbyfeldt/fix-minor-doc-comment-on-parallism-first. Authored-by: Emil Ejbyfeldt <eejbyfeldt@liveintent.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

github-actions bot added the SQL label Jan 18, 2024

beliefer force-pushed the SPARK-46760 branch from 0724d71 to 517ba2f Compare January 18, 2024 13:11

nchammas reviewed Jan 18, 2024

View reviewed changes

[SPARK-46760][SQL][DOCS] Make the document of spark.sql.adaptive.coal…

7e5b76a

…escePartitions.parallelismFirst clearer

beliefer force-pushed the SPARK-46760 branch from 517ba2f to 7e5b76a Compare January 19, 2024 08:03

github-actions bot added the DOCS label Jan 19, 2024

nchammas approved these changes Jan 19, 2024

View reviewed changes

cloud-fan reviewed Jan 24, 2024

View reviewed changes

srowen requested changes Jan 30, 2024

View reviewed changes

Update code

a6b49cd

Update code

09f1b1d

beliefer force-pushed the SPARK-46760 branch from 90ddc5b to 09f1b1d Compare February 3, 2024 03:11

srowen closed this in 9d4d41c Feb 3, 2024

eejbyfeldt mentioned this pull request Mar 8, 2024

[MINOR][DOCS][SQL] Fix doc comment for coalescePartitions.parallelismFirst #45437

Closed

-        "regression when enabling adaptive query execution. If you respect the configured " +
-        "target size, please set this config to false.")
+        "regressions when enabling adaptive query execution. To respect the configured " +
+        "target size, please set this config to false.")

[SPARK-46760][SQL][DOCS] Make the document of spark.sql.adaptive.coalescePartitions.parallelismFirst clearer #44787

[SPARK-46760][SQL][DOCS] Make the document of spark.sql.adaptive.coalescePartitions.parallelismFirst clearer #44787

Uh oh!

Conversation

beliefer commented Jan 18, 2024

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

nchammas Jan 18, 2024

Choose a reason for hiding this comment

Uh oh!

nchammas commented Jan 18, 2024

Uh oh!

beliefer commented Jan 21, 2024

Uh oh!

beliefer commented Jan 24, 2024

Uh oh!

cloud-fan Jan 24, 2024

Choose a reason for hiding this comment

Uh oh!

cloud-fan Feb 1, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

beliefer Feb 1, 2024

Choose a reason for hiding this comment

Uh oh!

eejbyfeldt Mar 8, 2024

Choose a reason for hiding this comment

Uh oh!

srowen left a comment

Choose a reason for hiding this comment

Uh oh!

beliefer commented Jan 31, 2024

Uh oh!

cloud-fan commented Feb 1, 2024

Uh oh!

srowen commented Feb 1, 2024

Uh oh!

srowen commented Feb 3, 2024

Uh oh!

beliefer commented Feb 4, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

cloud-fan Feb 1, 2024 •

edited

Loading