Skip to content

Conversation

@beliefer
Copy link
Contributor

What changes were proposed in this pull request?

This PR propose to make the document of spark.sql.adaptive.coalescePartitions.parallelismFirst clearer.

Why are the changes needed?

The default value of spark.sql.adaptive.coalescePartitions.parallelismFirst is true, but the document contains the word recommended to set this config to false and respect the configured target size. It's very confused.

Does this PR introduce any user-facing change?

'Yes'.
The document is more clear.

How was this patch tested?

N/A

Was this patch authored or co-authored using generative AI tooling?

'No'.

Comment on lines 724 to 725
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor suggestion:

Suggested change
"regression when enabling adaptive query execution. If you respect the configured " +
"target size, please set this config to false.")
"regressions when enabling adaptive query execution. To respect the configured " +
"target size, please set this config to false.")

@nchammas
Copy link
Contributor

Note that until we adopt some approach to automate the generation of config tables in our docs (e.g. like in #44755 or #44756), you will need to manually update the HTML here so it stays in sync with the source:

When true, Spark ignores the target size specified by <code>spark.sql.adaptive.advisoryPartitionSizeInBytes</code> (default 64MB) when coalescing contiguous shuffle partitions, and only respect the minimum partition size specified by <code>spark.sql.adaptive.coalescePartitions.minPartitionSize</code> (default 1MB), to maximize the parallelism. This is to avoid performance regression when enabling adaptive query execution. It's recommended to set this config to false and respect the target size specified by <code>spark.sql.adaptive.advisoryPartitionSizeInBytes</code>.

@beliefer
Copy link
Contributor Author

cc @cloud-fan

@beliefer
Copy link
Contributor Author

cc @MaxGekk @gengliangwang

"shuffle partitions, but adaptively calculate the target size according to the default " +
"parallelism of the Spark cluster. The calculated size is usually smaller than the " +
"configured target size. This is to maximize the parallelism and avoid performance " +
"regression when enabling adaptive query execution. It's recommended to set this config " +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@maryannxue is it really recommended?

Copy link
Contributor

@cloud-fan cloud-fan Feb 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe just say It's recommended to set this config to true on a busy cluster to make resource utilization more efficient (not many small tasks).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This suggestion contains a mistake right? It should be set to false in a busy cluster? #45437

Copy link
Member

@srowen srowen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree. I'm not sure why it says "it's recommended to be set to false" when that's not the default and 'true' is the safer default.

But this text still really doesn't help you decide anything. What does 'parallelism first' even mean? we can't change that name, but we can explain it.

Instead of all of the text starting at "This is to avoid ...", which isn't even that accurate (you can have a perf problem either way), let's continue:

"This is helpful where even small partitions with small data size require a large amount of computation, and so coalescing the small partitions reduces parallelism and harms performance. In more typical cases where this is not true, coalescing partitions can avoid many tiny tasks and improve performance, and so this config can be set to false"

@beliefer
Copy link
Contributor Author

"This is helpful where even small partitions with small data size require a large amount of computation, and so coalescing the small partitions reduces parallelism and harms performance. In more typical cases where this is not true, coalescing partitions can avoid many tiny tasks and improve performance, and so this config can be set to false"

This is very well.

@cloud-fan
Copy link
Contributor

This is true by default because in benchmarks we run one query at a time and it can use all the resources of the entire cluster. So parallelism is more important, as if we pick 64mb as the target size and save some CPU slots, no other tasks will use these free slots.

@srowen
Copy link
Member

srowen commented Feb 1, 2024

(Re run the tests)
Is the text OK @cloud-fan or are you suggesting a modification?

@srowen
Copy link
Member

srowen commented Feb 3, 2024

Merged to master

@beliefer
Copy link
Contributor Author

beliefer commented Feb 4, 2024

@srowen @cloud-fan Thank you!

HyukjinKwon pushed a commit that referenced this pull request Mar 11, 2024
…First

This changes the second `true` to `false` to make the doc comment correct. A setting of false will mean to not prioritize parallism and that will lead to less small tasks. Seems like it incorrectness was introduced here: #44787

### What changes were proposed in this pull request?

Documentation fix.

### Why are the changes needed?

Current doc is wrong.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

N/A

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #45437 from eejbyfeldt/fix-minor-doc-comment-on-parallism-first.

Authored-by: Emil Ejbyfeldt <eejbyfeldt@liveintent.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants