Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore: Enable shuffle by default #881

Merged
merged 12 commits into from
Aug 29, 2024
Merged

Conversation

andygrove
Copy link
Member

@andygrove andygrove commented Aug 28, 2024

Which issue does this PR close?

N/A

Rationale for this change

We would like to enalbe shuffle by default.

What changes are included in this PR?

  • Change default value of spark.comet.exec.shuffle.enabled from false to true
  • Explicitly enable shuffle in CometTestBase
  • Update some aggregate tests to disable shuffle (the tests will need updating if we enable shuffle)
  • Add an ORDER BY clause to some queries to make the tests deterministic

How are these changes tested?

@andygrove andygrove marked this pull request as draft August 28, 2024 23:38
Comment on lines +78 to 79
// TODO we should no longer be disabling COALESCE_PARTITIONS_ENABLED
conf.set(SQLConf.COALESCE_PARTITIONS_ENABLED.key, "false")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unrelated to this PR, but we should fix this in a separate PR

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed it in https://github.com/apache/datafusion-comet/pull/553/files#r1730694991, in order to trigger a test case failed in current code.

@andygrove andygrove marked this pull request as ready for review August 29, 2024 04:56
@@ -50,8 +50,7 @@ Comet provides the following configuration settings.
| spark.comet.exec.memoryFraction | The fraction of memory from Comet memory overhead that the native memory manager can use for execution. The purpose of this config is to set aside memory for untracked data structures, as well as imprecise size estimation during memory acquisition. Default value is 0.7. | 0.7 |
| spark.comet.exec.project.enabled | Whether to enable project by default. | true |
| spark.comet.exec.shuffle.codec | The codec of Comet native shuffle used to compress shuffle data. Only zstd is supported. | zstd |
| spark.comet.exec.shuffle.enabled | Whether to enable Comet native shuffle. By default, this config is false. Note that this requires setting 'spark.shuffle.manager' to 'org.apache.spark.sql.comet.execution.shuffle.CometShuffleManager'. 'spark.shuffle.manager' must be set before starting the Spark application and cannot be changed during the application. | false |
| spark.comet.exec.shuffle.mode | The mode of Comet shuffle. This config is only effective if Comet shuffle is enabled. Available modes are 'native', 'jvm', and 'auto'. 'native' is for native shuffle which has best performance in general. 'jvm' is for jvm-based columnar shuffle which has higher coverage than native shuffle. 'auto' is for Comet to choose the best shuffle mode based on the query plan. By default, this config is 'auto'. | auto |
| spark.comet.exec.shuffle.enabled | Whether to enable Comet native shuffle. By default, this config is false. Note that this requires setting 'spark.shuffle.manager' to 'org.apache.spark.sql.comet.execution.shuffle.CometShuffleManager'. 'spark.shuffle.manager' must be set before starting the Spark application and cannot be changed during the application. | true |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
| spark.comet.exec.shuffle.enabled | Whether to enable Comet native shuffle. By default, this config is false. Note that this requires setting 'spark.shuffle.manager' to 'org.apache.spark.sql.comet.execution.shuffle.CometShuffleManager'. 'spark.shuffle.manager' must be set before starting the Spark application and cannot be changed during the application. | true |
| spark.comet.exec.shuffle.enabled | Whether to enable Comet native shuffle. By default, this config is true. Note that this requires setting 'spark.shuffle.manager' to 'org.apache.spark.sql.comet.execution.shuffle.CometShuffleManager'. 'spark.shuffle.manager' must be set before starting the Spark application and cannot be changed during the application. | true |

@@ -189,7 +189,7 @@ object CometConf extends ShimCometConf {
"'spark.shuffle.manager' must be set before starting the Spark application and " +
"cannot be changed during the application.")
.booleanConf
.createWithDefault(false)
.createWithDefault(true)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also need to fix default value in description

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably automate adding the text stating the default value. Thanks for catching that.

Copy link
Contributor

@comphead comphead left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm thanks @andygrove some minors for defaults

@andygrove andygrove merged commit be10fee into apache:main Aug 29, 2024
74 checks passed
@andygrove andygrove deleted the enable-shuffle branch August 29, 2024 18:49
himadripal pushed a commit to himadripal/datafusion-comet that referenced this pull request Sep 7, 2024
* enable shuffle by default

* disable shuffle in CometTestBase

* format

* fix regressions

* fix

* fix more

* fix more

* fix regression

* fix regressions

* Revert refactor

* format

* update docs

(cherry picked from commit be10fee)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants