-
Notifications
You must be signed in to change notification settings - Fork 435
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(python): make table version always latest before doing merge #1924
fix(python): make table version always latest before doing merge #1924
Conversation
is this something we want to do? I would assume in actual user code, there may be some previous operations that took the current table state and may have derived some operation from it. At least I would expect that once I loaded the table, it remains at that state util I tell it otherwise :). |
@roeap if it's not the latest version while executing merge, you will get very strange errors regarding the schema which don't make sense. Also we are doing this in the writer as well, so we update incremental before writing. Also, I am not sure if it even would make sense to load an older version and then merge on those files, what would the expected result be there? If you want to achieve this I think you need to restore first to that version and then merge. |
hmm .. generally speaking all query engines that I am aware of treat the table state as a snapshot. Our DeltaTable abstraction right now is somewhere between an actual snapshot and a snapshot factory. I would have to look at the actual code paths again, to see what we are doing but generally speaking doing "hidden" updates can be quite dangerous, since the conflict resolution makes certain assumptions - which may not be relevant here. In this case I guess its not too bad, since it is before planning the actual merge. However since its literally one line, could users not just do that call before themselves? Maybe @wjones127 has an opinion on that? |
This reminds me a lot of the issue with #1863 that I hit. I agree with @roeap's assessment here that this could be dangerous. Would it it be possible to peek at the latest table state instead and determine if loaded_version != peeked_current_version and throw an error in that case? While I agree this is a user responsibility, the fact that @ion-elgreco hit a problem here tells me the API is still not as safe as it could be 😄 |
@rtyler that makes sense. I don't think we have exposed something like that, but I guess this is the correct function we want to use:
I'll make the change tomorrow :) |
Doing that can be really annoying if there are concurrent writers, though. |
@wjones127 what shall we do here? Because in the writes we do update the table incrementally and then write |
I actually came across this in another setting and was going to propose making the update in Part of the problem here is if you want to make serializable commits, you need a way to tie the data you're about to write back to its 'source' version for the serializability check. But by updating the table to latest, you're almost always asserting that the source of your read was the latest table. Imagine you have a table with one column with a series of integers in them. In the presence of two concurrent writers |
@emcake that doesn't sound right indeed. I'll put it back in draft, maybe it's good that we align this and then make the necessary changes also in the writer |
Description
Small trivial change, but enforces table is always latest version before executing MERGE.