fix(python): make table version always latest before doing merge #1924

ion-elgreco · 2023-11-30T08:46:23Z

Description

Small trivial change, but enforces table is always latest version before executing MERGE.

roeap · 2023-11-30T12:22:10Z

is this something we want to do?

I would assume in actual user code, there may be some previous operations that took the current table state and may have derived some operation from it. At least I would expect that once I loaded the table, it remains at that state util I tell it otherwise :).

ion-elgreco · 2023-11-30T12:26:05Z

@roeap if it's not the latest version while executing merge, you will get very strange errors regarding the schema which don't make sense. Also we are doing this in the writer as well, so we update incremental before writing.

Also, I am not sure if it even would make sense to load an older version and then merge on those files, what would the expected result be there? If you want to achieve this I think you need to restore first to that version and then merge.

roeap · 2023-11-30T12:33:56Z

hmm .. generally speaking all query engines that I am aware of treat the table state as a snapshot. Our DeltaTable abstraction right now is somewhere between an actual snapshot and a snapshot factory.

I would have to look at the actual code paths again, to see what we are doing but generally speaking doing "hidden" updates can be quite dangerous, since the conflict resolution makes certain assumptions - which may not be relevant here. In this case I guess its not too bad, since it is before planning the actual merge.

However since its literally one line, could users not just do that call before themselves?

Maybe @wjones127 has an opinion on that?

rtyler · 2023-12-02T17:57:13Z

This reminds me a lot of the issue with #1863 that I hit. I agree with @roeap's assessment here that this could be dangerous. Would it it be possible to peek at the latest table state instead and determine if loaded_version != peeked_current_version and throw an error in that case?

While I agree this is a user responsibility, the fact that @ion-elgreco hit a problem here tells me the API is still not as safe as it could be 😄

ion-elgreco · 2023-12-02T18:09:54Z

@rtyler that makes sense. I don't think we have exposed something like that, but I guess this is the correct function we want to use:

delta-rs/crates/deltalake-core/src/table/mod.rs

Line 456 in 18c4834

pub async fn peek_next_commit(

I'll make the change tomorrow :)

wjones127 · 2023-12-05T03:24:27Z

Would it it be possible to peek at the latest table state instead and determine if loaded_version != peeked_current_version and throw an error in that case?

Doing that can be really annoying if there are concurrent writers, though.

ion-elgreco · 2023-12-17T15:09:21Z

@wjones127 what shall we do here? Because in the writes we do update the table incrementally and then write

emcake · 2023-12-20T16:49:17Z

@wjones127 what shall we do here? Because in the writes we do update the table incrementally and then write

I actually came across this in another setting and was going to propose making the update in write_deltalake optional.

Part of the problem here is if you want to make serializable commits, you need a way to tie the data you're about to write back to its 'source' version for the serializability check. But by updating the table to latest, you're almost always asserting that the source of your read was the latest table.

Imagine you have a table with one column with a series of integers in them. In the presence of two concurrent writers A and B, you might have A updates the table to negate all the values, and then B updates the table to double all the values. The rust version will correctly throw if these writes happen concurrently, telling you that a new write has happened which invalidates your old one. The python version by updating to latest, allows whichever gets there the second one to win.

ion-elgreco · 2023-12-26T09:44:08Z

@emcake that doesn't sound right indeed. I'll put it back in draft, maybe it's good that we align this and then make the necessary changes also in the writer

make tbl version always latest before doing merge

5f31c6c

ion-elgreco requested review from wjones127, fvaleye and roeap as code owners November 30, 2023 08:46

github-actions bot added the binding/python Issues for the Python package label Nov 30, 2023

ion-elgreco enabled auto-merge (squash) November 30, 2023 09:41

Merge branch 'main' into fix/merge_update_incremental

5dfbcf8

Merge branch 'main' into fix/merge_update_incremental

714e887

ion-elgreco disabled auto-merge December 20, 2023 22:04

Merge branch 'main' into fix/merge_update_incremental

901a069

ion-elgreco closed this Dec 26, 2023

ion-elgreco reopened this Dec 26, 2023

ion-elgreco closed this Dec 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(python): make table version always latest before doing merge #1924

fix(python): make table version always latest before doing merge #1924

ion-elgreco commented Nov 30, 2023

roeap commented Nov 30, 2023

ion-elgreco commented Nov 30, 2023

roeap commented Nov 30, 2023 •

edited

Loading

rtyler commented Dec 2, 2023

ion-elgreco commented Dec 2, 2023 •

edited

Loading

wjones127 commented Dec 5, 2023

ion-elgreco commented Dec 17, 2023

emcake commented Dec 20, 2023

ion-elgreco commented Dec 26, 2023

fix(python): make table version always latest before doing merge #1924

fix(python): make table version always latest before doing merge #1924

Conversation

ion-elgreco commented Nov 30, 2023

Description

roeap commented Nov 30, 2023

ion-elgreco commented Nov 30, 2023

roeap commented Nov 30, 2023 • edited Loading

rtyler commented Dec 2, 2023

ion-elgreco commented Dec 2, 2023 • edited Loading

wjones127 commented Dec 5, 2023

ion-elgreco commented Dec 17, 2023

emcake commented Dec 20, 2023

ion-elgreco commented Dec 26, 2023

roeap commented Nov 30, 2023 •

edited

Loading

ion-elgreco commented Dec 2, 2023 •

edited

Loading