[SPARK-46995][DOCS][FOLLOWUP] Update `sql-migration-guide.md` documentation #47915

DennisJLi · 2024-08-28T17:32:56Z

What changes were proposed in this pull request?

Fixing the documentation.

Why are the changes needed?

Migration guide for 3.5.0 said a default was enabled, but upcoming changes for 4.0.0 will disable it but there are no documentation updates indicating this.

Does this PR introduce any user-facing change?

Yes, this fixes the documentation to align with actual Spark behavior introduced in becc04a.

How was this patch tested?

Documentation only change.

Was this patch authored or co-authored using generative AI tooling?

NO

dongjoon-hyun

As of now, it's true in branch-3.5. Do you mean it's enabled at 3.5.0 and disabled at 3.5.1 and re-enabled at 3.5.2?

spark/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

Lines 1535 to 1545 in dcfefd0

    
           val CAN_CHANGE_CACHED_PLAN_OUTPUT_PARTITIONING = 
        
             buildConf("spark.sql.optimizer.canChangeCachedPlanOutputPartitioning") 
        
               .internal() 
        
               .doc("Whether to forcibly enable some optimization rules that can change the output " + 
        
                 "partitioning of a cached query when executing it for caching. If it is set to true, " + 
        
                 "queries may need an extra shuffle to read the cached data. This configuration is " + 
        
                 "enabled by default. The optimization rules enabled by this configuration " + 
        
                 s"are ${ADAPTIVE_EXECUTION_ENABLED.key} and ${AUTO_BUCKETED_SCAN_ENABLED.key}.") 
        
               .version("3.2.0") 
        
               .booleanConf 
        
               .createWithDefault(true)

DennisJLi · 2024-08-28T18:17:04Z

Thanks for the high quality bar Dongjoon, you're right, all of the commits on branch-3.5 have it set to true and inspecting all of the zip sources from the tags shows that's the case too.

After further investigation, the issue I was experiencing seems to be due to JARs published by AWS on EMR differing from that in the Spark source.

Running on EMR 7.2.0, AWS states that they vend Spark 3.5.1. So I pulled down /usr/lib/spark/jars/spark-streaming_2.12-3.5.1-amzn-0.jar and this is what the Scala decompiled down into.

        this.CAN_CHANGE_CACHED_PLAN_OUTPUT_PARTITIONING = this.buildConf("spark.sql.optimizer.canChangeCachedPlanOutputPartitioning").internal().doc((new StringBuilder(326)).append("Whether to forcibly enable some optimization rules that can change the output partitioning of a cached query when executing it for caching. If it is set to true, queries may need an extra shuffle to read the cached data. This configuration is enabled by default. The optimization rules enabled by this configuration ").append("are ").append(this.ADAPTIVE_EXECUTION_ENABLED().key()).append(" and ").append(this.AUTO_BUCKETED_SCAN_ENABLED().key()).append(".").toString()).version("3.2.0").booleanConf().createWithDefault(BoxesRunTime.boxToBoolean(false));

I'm not sure why they changed it, but it's my bad for not checking to make sure the AWS vended JAR was the same first; I got mislead by the value on the master branch being false. Thankfully this does explain the behavior I was seeing.

I will change in PR to instead to include a note in the 3.5 to 4.0.0 documentation that this flag has been redisabled since I don't see that present yet.

DennisJLi · 2024-08-28T18:26:21Z

Commit has been updated to include 4.0.0 documentation.

dongjoon-hyun

cc @liuzqt , @cloud-fan , @yaooqinn from #45054

yaooqinn · 2024-08-29T03:37:34Z

spark.sql.optimizer.canChangeCachedPlanOutputPartitioning is an internal config. We rarely add migration guides for private things. The only case I know is that we add legacy/internal configs to restore public behaviors.

Also, documenting whether it is true or false does not capture the underlying changes made by #45054. Those changes are unlikely to be noticed by users.

dongjoon-hyun · 2024-08-29T05:30:36Z

Oh, right. I missed that this is an internal conf. In that case, ya, we can ignore this.

Stale

github-actions bot added the DOCS label Aug 28, 2024

DennisJLi mentioned this pull request Aug 28, 2024

[SPARK-41262][SQL] Fix spark.sql.optimizer.canChangeCachedPlanOutputPartitioning Configuration Default to Match Documentation #47914

Closed

dongjoon-hyun requested changes Aug 28, 2024

View reviewed changes

Update sql-migration-guide.md

8a7797d

DennisJLi force-pushed the update-documentation branch from d1001fb to 8a7797d Compare August 28, 2024 18:25

dongjoon-hyun reviewed Aug 28, 2024

View reviewed changes

dongjoon-hyun changed the title ~~Update sql-migration-guide.md documentation to include disablement of spark.sql.optimizer.canChangeCachedPlanOutputPartitioning in 4.0.0~~ [SPARK-46995][DOCS][FOLLOWUP] Update sql-migration-guide.md documentation Aug 28, 2024

dongjoon-hyun previously approved these changes Aug 28, 2024

View reviewed changes

DennisJLi closed this Aug 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-46995][DOCS][FOLLOWUP] Update `sql-migration-guide.md` documentation #47915

[SPARK-46995][DOCS][FOLLOWUP] Update `sql-migration-guide.md` documentation #47915

Uh oh!

DennisJLi commented Aug 28, 2024 •

edited

Loading

Uh oh!

dongjoon-hyun left a comment

Uh oh!

DennisJLi commented Aug 28, 2024

Uh oh!

DennisJLi commented Aug 28, 2024

Uh oh!

dongjoon-hyun left a comment

Uh oh!

yaooqinn commented Aug 29, 2024 •

edited

Loading

Uh oh!

dongjoon-hyun commented Aug 29, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	val CAN_CHANGE_CACHED_PLAN_OUTPUT_PARTITIONING =
	buildConf("spark.sql.optimizer.canChangeCachedPlanOutputPartitioning")
	.internal()
	.doc("Whether to forcibly enable some optimization rules that can change the output " +
	"partitioning of a cached query when executing it for caching. If it is set to true, " +
	"queries may need an extra shuffle to read the cached data. This configuration is " +
	"enabled by default. The optimization rules enabled by this configuration " +
	s"are ${ADAPTIVE_EXECUTION_ENABLED.key} and ${AUTO_BUCKETED_SCAN_ENABLED.key}.")
	.version("3.2.0")
	.booleanConf
	.createWithDefault(true)

[SPARK-46995][DOCS][FOLLOWUP] Update sql-migration-guide.md documentation #47915

[SPARK-46995][DOCS][FOLLOWUP] Update sql-migration-guide.md documentation #47915

Uh oh!

Conversation

DennisJLi commented Aug 28, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

DennisJLi commented Aug 28, 2024

Uh oh!

DennisJLi commented Aug 28, 2024

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

yaooqinn commented Aug 29, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dongjoon-hyun commented Aug 29, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[SPARK-46995][DOCS][FOLLOWUP] Update `sql-migration-guide.md` documentation #47915

[SPARK-46995][DOCS][FOLLOWUP] Update `sql-migration-guide.md` documentation #47915

DennisJLi commented Aug 28, 2024 •

edited

Loading

yaooqinn commented Aug 29, 2024 •

edited

Loading