Update Protocol Spec for Deletion Vectors #1372

larsk-db · 2022-09-07T14:34:10Z

Description

This PR makes the concrete changes proposed in [Feature Request] Deletion Vectors to speed up DML operations #1367 to the Delta protocol specification. For details of what this proposal entails, see that issues.
In addition, this PR makes some clarification changes to the wording in the spec in various places, many of which where necessary to correctly reflect concepts introduced by the proposal (e.g., logical files, exact column stat semantics).

How was this patch tested?

N/A (document-only).

Does this PR introduce any user-facing changes?

No.

tdas · 2022-09-12T17:11:10Z

PROTOCOL.md

-The path of a file acts as the primary key for the entry in the set of files.
-When an `add` action is encountered for a path that is already present in the table, statistics and other information from the latest version should replace that from any previous version.
-As such, additional statistics can be added for a path already present in the table by adding it again.
+Every _logical file_ of the table (referred to as just a _file_ going forward) is represented by a path to a data file, combined with a Deletion Vector (DV) that indicates which rows of the data file are no longer in the table. The path of the data file acts as the primary key for the entry in the set of files. Deletion Vectors are an optional feature. In tables that do not have this feature enabled, all Deletion Vectors are considered to be empty.


Suggested change

Every _logical file_ of the table (referred to as just a _file_ going forward) is represented by a path to a data file, combined with a Deletion Vector (DV) that indicates which rows of the data file are no longer in the table. The path of the data file acts as the primary key for the entry in the set of files. Deletion Vectors are an optional feature. In tables that do not have this feature enabled, all Deletion Vectors are considered to be empty.

Every _logical file_ of the table (referred to as just a _file_ going forward) is represented by a path to a data file, combined with an optional Deletion Vector (DV) that indicates which rows of the data file are no longer in the table. The path of the data file acts as the primary key for the entry in the set of files.

Maybe its also better to define as "physical data files" as a the combination of multiple types of files that represent a single "logical data file". Then henceforth we can use "physical data files" instead of "data file + DV" all over the place. This will make the doc easier to update later if we add more physical file type in the future.

The idea here was to introduce that "empty DV" is a legal DV, so we don't constantly have to spell out "optional DV" in all the places, but just write DV and it's implied that it may be empty (or null).

As for "physical data files", I'm not sure what this is supposed to refer to. I don't think we should treat the DV a "physical data file". It's not even always a "physical file", since it can be inline. Data files are e.g. parquet files, or anything that actually carries table data, not just protocol-level annotations like the DV (file or inline) does.

This "empty DV" is a "legal DV" is hardly intuitive. For a casual reader (even for me) this was not obvious. So i strongly suggest writing this better.

Ok, I went through everything again and tried to be very explicit about usage of "logical file" or a specific primary key tuple where appropriate/necessary.

PROTOCOL.md

tdas

Left a few high-level comments. We need to bump the protocol versions, so please add version (3,7) in the version list, and add the necessary discussion about version consideration similar to how Column Mapping has added it.

larsk-db · 2022-09-13T10:57:15Z

Left a few high-level comments. We need to bump the protocol versions, so please add version (3,7) in the version list, and add the necessary discussion about version consideration similar to how Column Mapping has added it.

About the version update: I'm a bit concerned that we are adding more and more features that every reader needs to implement along the way. Like, to have DVs, every reader now needs to support Column Mapping, because our protocol versions introduce this linear dependency between completely unrelated functionality. I wonder if there's isn't something we could do here to reduce the protocol support burden.

xupefei · 2022-09-14T09:06:06Z

Left a few high-level comments. We need to bump the protocol versions, so please add version (3,7) in the version list, and add the necessary discussion about version consideration similar to how Column Mapping has added it.

About the version update: I'm a bit concerned that we are adding more and more features that every reader needs to implement along the way. Like, to have DVs, every reader now needs to support Column Mapping, because our protocol versions introduce this linear dependency between completely unrelated functionality. I wonder if there's isn't something we could do here to reduce the protocol support burden.

We currently have a floating idea to solve this issue: instead of bundling multiple features (used and unused) into a single protocol version number, we can store only used features in the log, so readers can can read the table when all features that are actually used are supported.

A rough example:

Now: table is on reader protocol 3, which includes Column Mapping and DV
- Reader must support both features to read the table
Future: table is using only one feature DELETION_VECTOR
- Reader can read the table if it supports reading DV while do not support Column Mapping

IMO the example above is not a good one as Column Mapping support is much easier to implement than reading DVs. Nevertheless, the idea is that, by doing so we can avoid forcing clients to support all features up to a certain protocol version, and instead give them the freedom to choose supporting features they are interested in the most.

What do you think of this idea?

larsk-db · 2022-09-15T11:32:43Z

We currently have a floating idea to solve this issue
[...]
What do you think of this idea?

That sounds great.
By floating idea you mean, this is something we could actually add relatively soon? Like, would it make sense to delay this protocol change until we can implement it using the mechanism you described above?

xupefei · 2022-09-15T11:53:52Z

We currently have a floating idea to solve this issue
[...]
What do you think of this idea?

That sounds great. By floating idea you mean, this is something we could actually add relatively soon? Like, would it make sense to delay this protocol change until we can implement it using the mechanism you described above?

Yes! We're currently drafting a doc to describe it in detail. The idea is to make it play nice with existing protocol behaviors, thus fewer bugs and faster implementation.

When do you plan to release DV? If there's still some time, we could bump the protocol version for the feature thing I am working on, then make DV the first feature it supports 😀

larsk-db · 2022-09-15T12:01:55Z

When do you plan to release DV? If there's still some time, we could bump the protocol version for the feature thing I am working on, then make DV the first feature it supports 😀

Over the next month or so would be good.
I'm fine with waiting for the feature thingy and punting on adding version info to this PR, if it's alright with @tdas and @scottsand-db?

PROTOCOL.md

scottsand-db · 2022-09-15T23:00:41Z

PROTOCOL.md

+Field Name | Data Type | Description
+-|-|-
+storageType | String | A single character to indicate how to access the DV. (See below.)
+pathOrInlineDv | String | Three format options are currently proposed:<ul><li>If `storageType = 'u'` then  `<random prefix - optional><base85 encoded uuid>`: The deletion vector is stored in a file with a path relative to the data directory of this Delta table, and the  file name can be reconstructed from the UUID. See Derived Fields for how to reconstruct the file name. The random prefix is recovered as the extra characters before the (20 characters fixed length) uuid.</li><li>If `storageType = 'i'` then `<base85 encoded bytes>`: The deletion vector is stored inline in the log. The format used is the `RoaringBitmapArray` format also used when the DV is stored on disk and described in [Deletion Vector Format](#Deletion-Vector-Format).</li><li>If `storageType = 'p'` then `<absolute path>`: The DV is stored in a file with an absolute path given by this path, which has the same format as the `path` field in the `add`/`remove` actions.</li></ul>


isnt the convention [<random prefix>]<base85 encoded uuid> to show that random prefix is optional?

It could be. I didn't want anyone to think that the [...] are literals, so this seemed safer. (Of course, the same could be said for <...>...I had to pick some syntax :D)

larsk-db · 2022-10-10T12:26:54Z

I added a reader version now, so we have something there, but my intention is still to hold this until @xupefei introduces the feature thingy discussed above.

tdas · 2022-10-12T00:24:32Z

PROTOCOL.md

@@ -95,6 +95,7 @@ Here is an example of a Delta table with three entries in the commit log, stored
 /mytable/_delta_log/_last_checkpoint
 /mytable/_change_data/cdc-00000-924d9ac7-21a9-4121-b067-a0a6517aa8ed.c000.snappy.parquet
 /mytable/part-00000-3935a07c-416b-4344-ad97-2a38342ee2fc.c000.snappy.parquet
+/mytable/deletion_vector-0c6cbaaf-5e04-4c9d-8959-1088814f58ef.bin


are deletion vectors are always in the root directory? they are not in some subdirectory like _change_data ?

That is the proposal, yes. They are essentially required to read the data files, so imo it makes sense to store them alongside the data files. Except not taking partition hierarchy into account due to the "many DVs per file"-thingy.

Add DV changes to the Delta Protocol

37186cc

tdas self-requested a review September 7, 2022 15:11

tdas reviewed Sep 12, 2022

View reviewed changes