Skip to content

Conversation

@DennisJLi
Copy link

What changes were proposed in this pull request?

https://spark.apache.org/docs/latest/sql-migration-guide.html#upgrading-from-spark-sql-34-to-35 states that spark.sql.optimizer.canChangeCachedPlanOutputPartitioning is set by default, but that's currently not true, and causes confusion along with lost debugging time.

Notably, when working with cached dataframes in Spark 3.5.0, I saw AQE not working until I manually set the flag myself.

Why are the changes needed?

This aligns the code with the documentation which prevents confusion.

Does this PR introduce any user-facing change?

Yes, but nothing the documentation doesn't already state.

How was this patch tested?

Simple config change.

Was this patch authored or co-authored using generative AI tooling?

NO

…ration Default to Match Documentation

https://spark.apache.org/docs/latest/sql-migration-guide.html#upgrading-from-spark-sql-34-to-35 states that `spark.sql.optimizer.canChangeCachedPlanOutputPartitioning` is set by default, but that's currently not true, and causes confusion along with lost debugging time.
@github-actions github-actions bot added the SQL label Aug 28, 2024
@DennisJLi DennisJLi changed the title Fix spark.sql.optimizer.canChangeCachedPlanOutputPartitioning Configuration Default to Match Documentation Fix spark.sql.optimizer.canChangeCachedPlanOutputPartitioning Configuration Default to Match Documentation Aug 28, 2024
@DennisJLi DennisJLi changed the title Fix spark.sql.optimizer.canChangeCachedPlanOutputPartitioning Configuration Default to Match Documentation [SPARK-41262][SQL] Fix spark.sql.optimizer.canChangeCachedPlanOutputPartitioning Configuration Default to Match Documentation Aug 28, 2024
Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you know what causes this mismatch, @DennisJLi ?

BTW, for the mismatches, we usually fix the document instead of the real code because we cannot introduce a breaking behavior change on the released Apache Spark 3.5.x. However, if this is meaningful change, we can change it for Apache Spark 4.0.0 still.

@DennisJLi
Copy link
Author

Hi @dongjoon-hyun, good question. I figured it got dropped on the floor, but after doing some git blame investigation. It looks like...

  1. Change to set this configuration to true was done in this PR 1569ab5 (committed: 2023-08-22)
  2. Change was then undone in this PR https://github.com/apache/spark/commit/ (becc04a, presumably a bad rebase. (committed: 2024-02-07)

Ideally a patch version wouldn't have reverted this and there wasn't a documentation change, so I think if we just fix this value, then we'd go back to the expected state.

What do you think Dongjoon? Also, @ulysses-you , since you did the original change do you have any input here?

@DennisJLi
Copy link
Author

Sorry my bad, reading that other PR closer, it wanted to disable the value. So let me update the documentation instead. Sorry about the confusion, I rushed ahead on that.

@DennisJLi DennisJLi changed the base branch from master to branch-3.5 August 28, 2024 17:29
@DennisJLi DennisJLi changed the base branch from branch-3.5 to master August 28, 2024 17:29
@DennisJLi
Copy link
Author

#47915 documentation change published as a separate PR.

@DennisJLi DennisJLi closed this Aug 28, 2024
@DennisJLi DennisJLi deleted the patch-1 branch August 28, 2024 17:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants