Skip to content

Conversation

@zhengchenyu
Copy link
Contributor

What changes were proposed in this pull request?

The key modifications are as follows:

  • Force to use staging directory in SQLHadoopMapReduceCommitProtocol and perform a rename from the source directory to the destination directory during the commit job phase. Dynamic partition overwrite and custom partition path have also been integrated into this process.

Note: SQLHadoopMapReduceCommitProtocol performs directory-based renames, because table operations are generally directory-based. For HadoopMapReduceCommitProtocol, I believe the logic for renaming directories should be removed, while the logic for renaming files should be retained (although it is not used).

  • Avoid deleting partitions before task runs and implement dynamic overwrite in SQLHadoopMapReduceCommitProtocol. To maintain compatibility with static mode, the corresponding partition files need to be deleted during refreshUpdatedPartitions.

  • Handle paths according to SaveMode in SQLHadoopMapReduceCommitProtocol.

Note: For ease of review, some code in HadoopMapReduceCommitProtocol has been retained. In fact, I think the parameter dynamicPartitionOverwrite and the code for renaming partition directories during the commit job phase are no longer meaningful and should be removed.

Why are the changes needed?

SparkSQL uses the partition location or table location as the commit path (except in dynamic partition overwrite mode and custom partition path mode). This has at least the following issues:

  • As described in SPARK-37210, conflicts can occur when multiple partitions job of the same table are run concurrently. Using a staging directory can avoid this issue.
  • As described in SPARK-53937, using a staging directory allows for near-atomic operations.

Dynamic partition overwrite mode and custom partition path mode already use the staging directory. And dynamic partition overwrite mode and custom partition path are implemented differently, which can be simplified into a unified process. And in #29000, reset the staging directory as the output directory of FileOutputCommitter. This way is more safer. It should be modified to this way.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Existing unit tests and newly added unit tests

@zhengchenyu zhengchenyu marked this pull request as draft October 24, 2025 07:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant