-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update Protocol Spec for Deletion Vectors #1372
Conversation
PROTOCOL.md
Outdated
The path of a file acts as the primary key for the entry in the set of files. | ||
When an `add` action is encountered for a path that is already present in the table, statistics and other information from the latest version should replace that from any previous version. | ||
As such, additional statistics can be added for a path already present in the table by adding it again. | ||
Every _logical file_ of the table (referred to as just a _file_ going forward) is represented by a path to a data file, combined with a Deletion Vector (DV) that indicates which rows of the data file are no longer in the table. The path of the data file acts as the primary key for the entry in the set of files. Deletion Vectors are an optional feature. In tables that do not have this feature enabled, all Deletion Vectors are considered to be empty. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Every _logical file_ of the table (referred to as just a _file_ going forward) is represented by a path to a data file, combined with a Deletion Vector (DV) that indicates which rows of the data file are no longer in the table. The path of the data file acts as the primary key for the entry in the set of files. Deletion Vectors are an optional feature. In tables that do not have this feature enabled, all Deletion Vectors are considered to be empty. | |
Every _logical file_ of the table (referred to as just a _file_ going forward) is represented by a path to a data file, combined with an optional Deletion Vector (DV) that indicates which rows of the data file are no longer in the table. The path of the data file acts as the primary key for the entry in the set of files. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe its also better to define as "physical data files" as a the combination of multiple types of files that represent a single "logical data file". Then henceforth we can use "physical data files" instead of "data file + DV" all over the place. This will make the doc easier to update later if we add more physical file type in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea here was to introduce that "empty DV" is a legal DV, so we don't constantly have to spell out "optional DV" in all the places, but just write DV and it's implied that it may be empty (or null).
As for "physical data files", I'm not sure what this is supposed to refer to. I don't think we should treat the DV a "physical data file". It's not even always a "physical file", since it can be inline. Data files are e.g. parquet files, or anything that actually carries table data, not just protocol-level annotations like the DV (file or inline) does.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This "empty DV" is a "legal DV" is hardly intuitive. For a casual reader (even for me) this was not obvious. So i strongly suggest writing this better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I went through everything again and tried to be very explicit about usage of "logical file" or a specific primary key tuple where appropriate/necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a few high-level comments. We need to bump the protocol versions, so please add version (3,7) in the version list, and add the necessary discussion about version consideration similar to how Column Mapping has added it.
About the version update: I'm a bit concerned that we are adding more and more features that every reader needs to implement along the way. Like, to have DVs, every reader now needs to support Column Mapping, because our protocol versions introduce this linear dependency between completely unrelated functionality. I wonder if there's isn't something we could do here to reduce the protocol support burden. |
We currently have a floating idea to solve this issue: instead of bundling multiple features (used and unused) into a single protocol version number, we can store only used features in the log, so readers can can read the table when all features that are actually used are supported. A rough example:
IMO the example above is not a good one as Column Mapping support is much easier to implement than reading DVs. Nevertheless, the idea is that, by doing so we can avoid forcing clients to support all features up to a certain protocol version, and instead give them the freedom to choose supporting features they are interested in the most. What do you think of this idea? |
That sounds great. |
Yes! We're currently drafting a doc to describe it in detail. The idea is to make it play nice with existing protocol behaviors, thus fewer bugs and faster implementation. When do you plan to release DV? If there's still some time, we could bump the protocol version for the feature thing I am working on, then make DV the first feature it supports 😀 |
Over the next month or so would be good. |
PROTOCOL.md
Outdated
Field Name | Data Type | Description | ||
-|-|- | ||
storageType | String | A single character to indicate how to access the DV. (See below.) | ||
pathOrInlineDv | String | Three format options are currently proposed:<ul><li>If `storageType = 'u'` then `<random prefix - optional><base85 encoded uuid>`: The deletion vector is stored in a file with a path relative to the data directory of this Delta table, and the file name can be reconstructed from the UUID. See Derived Fields for how to reconstruct the file name. The random prefix is recovered as the extra characters before the (20 characters fixed length) uuid.</li><li>If `storageType = 'i'` then `<base85 encoded bytes>`: The deletion vector is stored inline in the log. The format used is the `RoaringBitmapArray` format also used when the DV is stored on disk and described in [Deletion Vector Format](#Deletion-Vector-Format).</li><li>If `storageType = 'p'` then `<absolute path>`: The DV is stored in a file with an absolute path given by this path, which has the same format as the `path` field in the `add`/`remove` actions.</li></ul> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isnt the convention [<random prefix>]<base85 encoded uuid>
to show that random prefix is optional?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It could be. I didn't want anyone to think that the [...]
are literals, so this seemed safer. (Of course, the same could be said for <...>
...I had to pick some syntax :D)
I added a reader version now, so we have something there, but my intention is still to hold this until @xupefei introduces the feature thingy discussed above. |
@@ -95,6 +95,7 @@ Here is an example of a Delta table with three entries in the commit log, stored | |||
/mytable/_delta_log/_last_checkpoint | |||
/mytable/_change_data/cdc-00000-924d9ac7-21a9-4121-b067-a0a6517aa8ed.c000.snappy.parquet | |||
/mytable/part-00000-3935a07c-416b-4344-ad97-2a38342ee2fc.c000.snappy.parquet | |||
/mytable/deletion_vector-0c6cbaaf-5e04-4c9d-8959-1088814f58ef.bin |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are deletion vectors are always in the root directory? they are not in some subdirectory like _change_data
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is the proposal, yes. They are essentially required to read the data files, so imo it makes sense to store them alongside the data files. Except not taking partition hierarchy into account due to the "many DVs per file"-thingy.
Description
How was this patch tested?
N/A (document-only).
Does this PR introduce any user-facing changes?
No.