Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(python): make table version always latest before doing merge #1924

Closed

Conversation

ion-elgreco
Copy link
Collaborator

Description

Small trivial change, but enforces table is always latest version before executing MERGE.

@github-actions github-actions bot added the binding/python Issues for the Python package label Nov 30, 2023
@ion-elgreco ion-elgreco enabled auto-merge (squash) November 30, 2023 09:41
@roeap
Copy link
Collaborator

roeap commented Nov 30, 2023

is this something we want to do?

I would assume in actual user code, there may be some previous operations that took the current table state and may have derived some operation from it. At least I would expect that once I loaded the table, it remains at that state util I tell it otherwise :).

@ion-elgreco
Copy link
Collaborator Author

@roeap if it's not the latest version while executing merge, you will get very strange errors regarding the schema which don't make sense. Also we are doing this in the writer as well, so we update incremental before writing.

Also, I am not sure if it even would make sense to load an older version and then merge on those files, what would the expected result be there? If you want to achieve this I think you need to restore first to that version and then merge.

@roeap
Copy link
Collaborator

roeap commented Nov 30, 2023

hmm .. generally speaking all query engines that I am aware of treat the table state as a snapshot. Our DeltaTable abstraction right now is somewhere between an actual snapshot and a snapshot factory.

I would have to look at the actual code paths again, to see what we are doing but generally speaking doing "hidden" updates can be quite dangerous, since the conflict resolution makes certain assumptions - which may not be relevant here. In this case I guess its not too bad, since it is before planning the actual merge.

However since its literally one line, could users not just do that call before themselves?

Maybe @wjones127 has an opinion on that?

@rtyler
Copy link
Member

rtyler commented Dec 2, 2023

This reminds me a lot of the issue with #1863 that I hit. I agree with @roeap's assessment here that this could be dangerous. Would it it be possible to peek at the latest table state instead and determine if loaded_version != peeked_current_version and throw an error in that case?

While I agree this is a user responsibility, the fact that @ion-elgreco hit a problem here tells me the API is still not as safe as it could be 😄

@ion-elgreco
Copy link
Collaborator Author

ion-elgreco commented Dec 2, 2023

@rtyler that makes sense. I don't think we have exposed something like that, but I guess this is the correct function we want to use:

pub async fn peek_next_commit(

I'll make the change tomorrow :)

@wjones127
Copy link
Collaborator

Would it it be possible to peek at the latest table state instead and determine if loaded_version != peeked_current_version and throw an error in that case?

Doing that can be really annoying if there are concurrent writers, though.

@ion-elgreco
Copy link
Collaborator Author

@wjones127 what shall we do here? Because in the writes we do update the table incrementally and then write

@emcake
Copy link
Contributor

emcake commented Dec 20, 2023

@wjones127 what shall we do here? Because in the writes we do update the table incrementally and then write

I actually came across this in another setting and was going to propose making the update in write_deltalake optional.

Part of the problem here is if you want to make serializable commits, you need a way to tie the data you're about to write back to its 'source' version for the serializability check. But by updating the table to latest, you're almost always asserting that the source of your read was the latest table.

Imagine you have a table with one column with a series of integers in them. In the presence of two concurrent writers A and B, you might have A updates the table to negate all the values, and then B updates the table to double all the values. The rust version will correctly throw if these writes happen concurrently, telling you that a new write has happened which invalidates your old one. The python version by updating to latest, allows whichever gets there the second one to win.

@ion-elgreco ion-elgreco disabled auto-merge December 20, 2023 22:04
@ion-elgreco
Copy link
Collaborator Author

@emcake that doesn't sound right indeed. I'll put it back in draft, maybe it's good that we align this and then make the necessary changes also in the writer

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
binding/python Issues for the Python package
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants