Spark 3.3: support version travel by reference name#6575
Spark 3.3: support version travel by reference name#6575jackye1995 merged 5 commits intoapache:masterfrom jackye1995:tag-version-travel
Conversation
| ValidationException.check( | ||
| ref != null, | ||
| "Cannot find matching snapshot ID or reference name for version " + version); | ||
| return sparkTable.copyWithSnapshotId(ref.snapshotId()); |
There was a problem hiding this comment.
It always use the latest commit from the reference.
I think we also need to provide a way to time travel to a snapshot within a branch/tag?
So along with existing of FOR SYSTEM_VERSION AS OF snapshotId
we should have FOR SYSTEM_VERSION AS OF snapshotId@refName
But whether to use '@' or some other syntax is an open point for a long time which @rdblue wanted to conclude.
Nessie SQL syntax for reference:
https://projectnessie.org/tools/sql/#grammar
There was a problem hiding this comment.
Never mind.
After thinking a bit more about it and reviewing #6573, As the snapshot log contains all the snapshots from all the branches/tags. If we want to use any particular snapshot, we can directly use snapshot-id without specifying branch/tag information. So, no need of snapshotId@refName syntax
There was a problem hiding this comment.
I think one thing that might be useful is to time travel in a branch, something like FOR SYSTEM_VERSION AS OF branchName@123456789012. But that feels very hacky, I' rather have some syntax as we have been suggesting like SELECT * FROM table BRANCH branch FOR SYSTEM_TIME AS OF xxx. So I am leaving that part out of the implementation for now. At least I think most people can agree that a tag/branch head can be viewed as a version to travel to.
spark/v3.3/spark/src/test/java/org/apache/iceberg/spark/sql/TestSelect.java
Outdated
Show resolved
Hide resolved
|
@aokolnychyi @RussellSpitzer @rdblue any opinions about this support? |
spark/v3.3/spark/src/test/java/org/apache/iceberg/spark/sql/TestSelect.java
Outdated
Show resolved
Hide resolved
| "Cannot do time-travel based on both table identifier and AS OF"); | ||
|
|
||
| return sparkTable.copyWithSnapshotId(Long.parseLong(version)); | ||
| try { |
There was a problem hiding this comment.
I'm not sure this can come up, but do we allow version tags to be SnapshotIds?
Like can I tag snapshot 2 to be known as 1?
Weird edge case so I don't think we really need to handle it, just thinking if this is a potential issue with the lookup code here
There was a problem hiding this comment.
Currently there's no restrictions on what references can be named. For the lookup code, I think we should always be able to differentiate between snapshot ID and ref since for refs it will be in a quoted identifier, and should always fail the Long.parseLong() with a NumberParseException. So the current implementation seems good to me.
But that's just me reading the code :), I think it's worth having a unit test just for this case to give us that confidence that it works as expected in this scenario. cc @jackye1995 let me know your thoughts
There was a problem hiding this comment.
I added a test case specifically for this. Unlike Trino, Spark directly ignores the type of the VERSION AS OF, so if a tag name matches exactly the snapshot ID, then snapshot ID is always chosen.
I think this is a okay limitation, because people can work around it by adding some text like snapshot-123456890 as the tag name. But we should make it very clear in documentation.
There was a problem hiding this comment.
Yeah I don't want this to be a blocker, just something to take note of.
|
we also need to update this SQL syntax in |
I think @amogh-jahagirdar is working on a more detailed doc for branching and tagging, I will leave this part to him 😝 |
|
Thanks for the review everyone, I think there are enough votes and all concerns are addressed, I will go ahead merging this PR! |
Similar to the Trino PR I am trying to push trinodb/trino#15646, support using reference name for the
VERSION AS OFclause.We have a related PR #5294 by @hililiwei that tries to directly add the reference info in table name, while we are consolidating that experience, I think we can first have this feature added in parallel.
@amogh-jahagirdar