Skip to content

Spark: Custom snapshot property from session configuration#12999

Closed
cccs-jory wants to merge 8 commits intoapache:mainfrom
cccs-jory:spark-session-config-custom-snapshot-property
Closed

Spark: Custom snapshot property from session configuration#12999
cccs-jory wants to merge 8 commits intoapache:mainfrom
cccs-jory:spark-session-config-custom-snapshot-property

Conversation

@cccs-jory
Copy link

Adds support to allow custom snapshot properties to be specified in the Spark session configuration. This allows users to add custom snapshot properties even when running spark SQL DML such as DELETE or MERGE INTO, which was not previously supported.

@guykhazma
Copy link
Contributor

guykhazma commented May 7, 2025

Considering that the goal is to be able to set it dynamically for each query I think it would be better if there is a way to include it in the query itself instead of having it as a session configuration.
I don't know of way to pass properties using the SQL syntax but I wonder whether the hinting mechanism in spark or something similar (if exists) can be used to communicate this kind of properties dynamically as part of the query.

@cccs-jory
Copy link
Author

Considering that the goal is to be able to set it dynamically for each query

I don't think that's the goal necessarily. For example an org may create a spark session that executes DML on many tables, therefore creating many commits. That spark session could set this configuration and have its custom snapshot property applied on each table's commits. Setting it dynamically in Spark SQL could be a different task?

Copy link
Contributor

@singhpk234 singhpk234 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar requirements did come up in the past : #4956 is this not sufficient ? or you want this for specifically with the SQLConfs ?

Can you please elaborate your use case :
is the concern this is not supported for DELETE / MERGE and only INSERT ? if yes we should fix that.

@cccs-jory
Copy link
Author

Similar requirements did come up in the past : #4956 is this not sufficient ? or you want this for specifically with the SQLConfs ?

Can you please elaborate your use case : is the concern this is not supported for DELETE / MERGE and only INSERT ? if yes we should fix that.

withCommitProperties is only accessible via Java. If teams run PySpark then there is no way to access it (outside of using py4j).

@cccs-jory cccs-jory requested a review from singhpk234 May 12, 2025 17:25
@RussellSpitzer
Copy link
Member

Please only target one Spark Version in this PR for easier reviewing, we can do a fast followup that changes earlier versions after we merge this.

@Test
public void testExtraSnapshotMetadataReflectsSessionConfig() {
withSQLConf(
ImmutableMap.of("spark.sql.iceberg.snapshot-property.test-key", "test-value"),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trivial nit: "test-value" -> "session-value" so it mimics the test below. This is the tiniest of nits so feel free to ignore.

Copy link
Member

@RussellSpitzer RussellSpitzer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a tiny nit, and since these changes are so small I don't think we need to split apart the PR into different versions.

@RussellSpitzer
Copy link
Member

Looks like there is a build issue @cccs-jory, please update and let me know when to run tests again :)

@cccs-jory
Copy link
Author

cccs-jory commented May 14, 2025

@RussellSpitzer I removed the feature on 3.4 and addressed the nitpick, however I may need some more eyes to fix the build issue.

Essentially gradlew is failing when compiling on 3.5 because it can't find Spark's RuntimeConfig.getAll(), however when I look at the Javadocs it is clearly there and exists. Even if I switch out sessionConf.getAll() for spark.conf().getAll(), it still fails. This is my first contribution so I'm not super familiar with the intricacies of Spark's core.

@github-actions
Copy link

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@iceberg.apache.org list. Thank you for your contributions.

@github-actions github-actions bot added the stale label Jul 13, 2025
@cccs-jory cccs-jory requested a review from RussellSpitzer July 14, 2025 12:11
@github-actions github-actions bot removed the stale label Jul 15, 2025
@anuragmantri
Copy link
Contributor

I think this is still relevant and not stale. The default version of Spark changed to 4.0 so we may want to move the changes to that branch. @cccs-jory, could you address comments and update the PR?

Copy link
Member

@RussellSpitzer RussellSpitzer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any follow up on this? Looks like we are just waiting on the change from @singhpk234 ?

@cccs-jory
Copy link
Author

Yep my bad, time is in short supply around here. I'll get that in soon.

@RussellSpitzer
Copy link
Member

:) We are all oversubscribed for sure

@github-actions
Copy link

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@iceberg.apache.org list. Thank you for your contributions.

@github-actions github-actions bot added the stale label Sep 11, 2025
@github-actions
Copy link

This pull request has been closed due to lack of activity. This is not a judgement on the merit of the PR in any way. It is just a way of keeping the PR queue manageable. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time.

@github-actions github-actions bot closed this Sep 19, 2025
@owen6314
Copy link
Contributor

owen6314 commented Oct 16, 2025

Any update on this PR? We've found this custom snapshot property support from session configuration helpful for our use cases, wondering if anything is blocking the merge.
@cccs-jory also happy to help move this forward if you'd like.

@cccs-jory
Copy link
Author

@owen6314 It would be great if someone could pick this up and finalize it - I have not had a chance to get back around to it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants