Skip to content

Conversation

@DennisJLi
Copy link

@DennisJLi DennisJLi commented Aug 28, 2024

What changes were proposed in this pull request?

Fixing the documentation.

Why are the changes needed?

Migration guide for 3.5.0 said a default was enabled, but upcoming changes for 4.0.0 will disable it but there are no documentation updates indicating this.

Does this PR introduce any user-facing change?

Yes, this fixes the documentation to align with actual Spark behavior introduced in becc04a.

How was this patch tested?

Documentation only change.

Was this patch authored or co-authored using generative AI tooling?

NO

@github-actions github-actions bot added the DOCS label Aug 28, 2024
@DennisJLi DennisJLi changed the title Update sql-migration-guide.md documentation to include new spark.sql.optimizer.canChangeCachedPlanOutputPartitioning in 3.5.1 Update sql-migration-guide.md documentation to include disablement of spark.sql.optimizer.canChangeCachedPlanOutputPartitioning in 3.5.1 Aug 28, 2024
Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As of now, it's true in branch-3.5. Do you mean it's enabled at 3.5.0 and disabled at 3.5.1 and re-enabled at 3.5.2?

val CAN_CHANGE_CACHED_PLAN_OUTPUT_PARTITIONING =
buildConf("spark.sql.optimizer.canChangeCachedPlanOutputPartitioning")
.internal()
.doc("Whether to forcibly enable some optimization rules that can change the output " +
"partitioning of a cached query when executing it for caching. If it is set to true, " +
"queries may need an extra shuffle to read the cached data. This configuration is " +
"enabled by default. The optimization rules enabled by this configuration " +
s"are ${ADAPTIVE_EXECUTION_ENABLED.key} and ${AUTO_BUCKETED_SCAN_ENABLED.key}.")
.version("3.2.0")
.booleanConf
.createWithDefault(true)

@DennisJLi
Copy link
Author

Thanks for the high quality bar Dongjoon, you're right, all of the commits on branch-3.5 have it set to true and inspecting all of the zip sources from the tags shows that's the case too.

After further investigation, the issue I was experiencing seems to be due to JARs published by AWS on EMR differing from that in the Spark source.

Running on EMR 7.2.0, AWS states that they vend Spark 3.5.1. So I pulled down /usr/lib/spark/jars/spark-streaming_2.12-3.5.1-amzn-0.jar and this is what the Scala decompiled down into.

        this.CAN_CHANGE_CACHED_PLAN_OUTPUT_PARTITIONING = this.buildConf("spark.sql.optimizer.canChangeCachedPlanOutputPartitioning").internal().doc((new StringBuilder(326)).append("Whether to forcibly enable some optimization rules that can change the output partitioning of a cached query when executing it for caching. If it is set to true, queries may need an extra shuffle to read the cached data. This configuration is enabled by default. The optimization rules enabled by this configuration ").append("are ").append(this.ADAPTIVE_EXECUTION_ENABLED().key()).append(" and ").append(this.AUTO_BUCKETED_SCAN_ENABLED().key()).append(".").toString()).version("3.2.0").booleanConf().createWithDefault(BoxesRunTime.boxToBoolean(false));

I'm not sure why they changed it, but it's my bad for not checking to make sure the AWS vended JAR was the same first; I got mislead by the value on the master branch being false. Thankfully this does explain the behavior I was seeing.

I will change in PR to instead to include a note in the 3.5 to 4.0.0 documentation that this flag has been redisabled since I don't see that present yet.

@DennisJLi DennisJLi force-pushed the update-documentation branch from d1001fb to 8a7797d Compare August 28, 2024 18:25
@DennisJLi
Copy link
Author

Commit has been updated to include 4.0.0 documentation.

@DennisJLi DennisJLi changed the title Update sql-migration-guide.md documentation to include disablement of spark.sql.optimizer.canChangeCachedPlanOutputPartitioning in 3.5.1 Update sql-migration-guide.md documentation to include disablement of spark.sql.optimizer.canChangeCachedPlanOutputPartitioning in 4.0.0 Aug 28, 2024
Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dongjoon-hyun dongjoon-hyun changed the title Update sql-migration-guide.md documentation to include disablement of spark.sql.optimizer.canChangeCachedPlanOutputPartitioning in 4.0.0 [SPARK-46995][DOCS][FOLLOWUP] Update sql-migration-guide.md documentation Aug 28, 2024
dongjoon-hyun
dongjoon-hyun previously approved these changes Aug 28, 2024
@yaooqinn
Copy link
Member

yaooqinn commented Aug 29, 2024

spark.sql.optimizer.canChangeCachedPlanOutputPartitioning is an internal config. We rarely add migration guides for private things. The only case I know is that we add legacy/internal configs to restore public behaviors.

Also, documenting whether it is true or false does not capture the underlying changes made by #45054. Those changes are unlikely to be noticed by users.

@dongjoon-hyun
Copy link
Member

Oh, right. I missed that this is an internal conf. In that case, ya, we can ignore this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants