Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update Protocol Spec for Deletion Vectors #1372

Closed
wants to merge 4 commits into from

Conversation

larsk-db
Copy link
Contributor

@larsk-db larsk-db commented Sep 7, 2022

Description

  • This PR makes the concrete changes proposed in [Feature Request] Deletion Vectors to speed up DML operations #1367 to the Delta protocol specification. For details of what this proposal entails, see that issues.
  • In addition, this PR makes some clarification changes to the wording in the spec in various places, many of which where necessary to correctly reflect concepts introduced by the proposal (e.g., logical files, exact column stat semantics).

How was this patch tested?

N/A (document-only).

Does this PR introduce any user-facing changes?

No.

@tdas tdas self-requested a review September 7, 2022 15:11
PROTOCOL.md Outdated
The path of a file acts as the primary key for the entry in the set of files.
When an `add` action is encountered for a path that is already present in the table, statistics and other information from the latest version should replace that from any previous version.
As such, additional statistics can be added for a path already present in the table by adding it again.
Every _logical file_ of the table (referred to as just a _file_ going forward) is represented by a path to a data file, combined with a Deletion Vector (DV) that indicates which rows of the data file are no longer in the table. The path of the data file acts as the primary key for the entry in the set of files. Deletion Vectors are an optional feature. In tables that do not have this feature enabled, all Deletion Vectors are considered to be empty.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Every _logical file_ of the table (referred to as just a _file_ going forward) is represented by a path to a data file, combined with a Deletion Vector (DV) that indicates which rows of the data file are no longer in the table. The path of the data file acts as the primary key for the entry in the set of files. Deletion Vectors are an optional feature. In tables that do not have this feature enabled, all Deletion Vectors are considered to be empty.
Every _logical file_ of the table (referred to as just a _file_ going forward) is represented by a path to a data file, combined with an optional Deletion Vector (DV) that indicates which rows of the data file are no longer in the table. The path of the data file acts as the primary key for the entry in the set of files.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe its also better to define as "physical data files" as a the combination of multiple types of files that represent a single "logical data file". Then henceforth we can use "physical data files" instead of "data file + DV" all over the place. This will make the doc easier to update later if we add more physical file type in the future.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea here was to introduce that "empty DV" is a legal DV, so we don't constantly have to spell out "optional DV" in all the places, but just write DV and it's implied that it may be empty (or null).

As for "physical data files", I'm not sure what this is supposed to refer to. I don't think we should treat the DV a "physical data file". It's not even always a "physical file", since it can be inline. Data files are e.g. parquet files, or anything that actually carries table data, not just protocol-level annotations like the DV (file or inline) does.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This "empty DV" is a "legal DV" is hardly intuitive. For a casual reader (even for me) this was not obvious. So i strongly suggest writing this better.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I went through everything again and tried to be very explicit about usage of "logical file" or a specific primary key tuple where appropriate/necessary.

PROTOCOL.md Outdated Show resolved Hide resolved
PROTOCOL.md Show resolved Hide resolved
PROTOCOL.md Outdated Show resolved Hide resolved
PROTOCOL.md Outdated Show resolved Hide resolved
PROTOCOL.md Outdated Show resolved Hide resolved
PROTOCOL.md Outdated Show resolved Hide resolved
Copy link
Contributor

@tdas tdas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few high-level comments. We need to bump the protocol versions, so please add version (3,7) in the version list, and add the necessary discussion about version consideration similar to how Column Mapping has added it.

@larsk-db
Copy link
Contributor Author

Left a few high-level comments. We need to bump the protocol versions, so please add version (3,7) in the version list, and add the necessary discussion about version consideration similar to how Column Mapping has added it.

About the version update: I'm a bit concerned that we are adding more and more features that every reader needs to implement along the way. Like, to have DVs, every reader now needs to support Column Mapping, because our protocol versions introduce this linear dependency between completely unrelated functionality. I wonder if there's isn't something we could do here to reduce the protocol support burden.

@xupefei
Copy link
Contributor

xupefei commented Sep 14, 2022

Left a few high-level comments. We need to bump the protocol versions, so please add version (3,7) in the version list, and add the necessary discussion about version consideration similar to how Column Mapping has added it.

About the version update: I'm a bit concerned that we are adding more and more features that every reader needs to implement along the way. Like, to have DVs, every reader now needs to support Column Mapping, because our protocol versions introduce this linear dependency between completely unrelated functionality. I wonder if there's isn't something we could do here to reduce the protocol support burden.

We currently have a floating idea to solve this issue: instead of bundling multiple features (used and unused) into a single protocol version number, we can store only used features in the log, so readers can can read the table when all features that are actually used are supported.

A rough example:

  • Now: table is on reader protocol 3, which includes Column Mapping and DV
    • Reader must support both features to read the table
  • Future: table is using only one feature DELETION_VECTOR
    • Reader can read the table if it supports reading DV while do not support Column Mapping

IMO the example above is not a good one as Column Mapping support is much easier to implement than reading DVs. Nevertheless, the idea is that, by doing so we can avoid forcing clients to support all features up to a certain protocol version, and instead give them the freedom to choose supporting features they are interested in the most.

What do you think of this idea?

@scottsand-db scottsand-db self-requested a review September 14, 2022 18:09
@larsk-db
Copy link
Contributor Author

We currently have a floating idea to solve this issue
[...]
What do you think of this idea?

That sounds great.
By floating idea you mean, this is something we could actually add relatively soon? Like, would it make sense to delay this protocol change until we can implement it using the mechanism you described above?

@xupefei
Copy link
Contributor

xupefei commented Sep 15, 2022

We currently have a floating idea to solve this issue
[...]
What do you think of this idea?

That sounds great. By floating idea you mean, this is something we could actually add relatively soon? Like, would it make sense to delay this protocol change until we can implement it using the mechanism you described above?

Yes! We're currently drafting a doc to describe it in detail. The idea is to make it play nice with existing protocol behaviors, thus fewer bugs and faster implementation.

When do you plan to release DV? If there's still some time, we could bump the protocol version for the feature thing I am working on, then make DV the first feature it supports 😀

@larsk-db
Copy link
Contributor Author

When do you plan to release DV? If there's still some time, we could bump the protocol version for the feature thing I am working on, then make DV the first feature it supports 😀

Over the next month or so would be good.
I'm fine with waiting for the feature thingy and punting on adding version info to this PR, if it's alright with @tdas and @scottsand-db?

PROTOCOL.md Outdated Show resolved Hide resolved
PROTOCOL.md Outdated Show resolved Hide resolved
PROTOCOL.md Outdated Show resolved Hide resolved
PROTOCOL.md Outdated Show resolved Hide resolved
PROTOCOL.md Outdated Show resolved Hide resolved
PROTOCOL.md Outdated Show resolved Hide resolved
PROTOCOL.md Outdated
Field Name | Data Type | Description
-|-|-
storageType | String | A single character to indicate how to access the DV. (See below.)
pathOrInlineDv | String | Three format options are currently proposed:<ul><li>If `storageType = 'u'` then `<random prefix - optional><base85 encoded uuid>`: The deletion vector is stored in a file with a path relative to the data directory of this Delta table, and the file name can be reconstructed from the UUID. See Derived Fields for how to reconstruct the file name. The random prefix is recovered as the extra characters before the (20 characters fixed length) uuid.</li><li>If `storageType = 'i'` then `<base85 encoded bytes>`: The deletion vector is stored inline in the log. The format used is the `RoaringBitmapArray` format also used when the DV is stored on disk and described in [Deletion Vector Format](#Deletion-Vector-Format).</li><li>If `storageType = 'p'` then `<absolute path>`: The DV is stored in a file with an absolute path given by this path, which has the same format as the `path` field in the `add`/`remove` actions.</li></ul>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isnt the convention [<random prefix>]<base85 encoded uuid> to show that random prefix is optional?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could be. I didn't want anyone to think that the [...] are literals, so this seemed safer. (Of course, the same could be said for <...>...I had to pick some syntax :D)

@larsk-db
Copy link
Contributor Author

I added a reader version now, so we have something there, but my intention is still to hold this until @xupefei introduces the feature thingy discussed above.

@@ -95,6 +95,7 @@ Here is an example of a Delta table with three entries in the commit log, stored
/mytable/_delta_log/_last_checkpoint
/mytable/_change_data/cdc-00000-924d9ac7-21a9-4121-b067-a0a6517aa8ed.c000.snappy.parquet
/mytable/part-00000-3935a07c-416b-4344-ad97-2a38342ee2fc.c000.snappy.parquet
/mytable/deletion_vector-0c6cbaaf-5e04-4c9d-8959-1088814f58ef.bin
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are deletion vectors are always in the root directory? they are not in some subdirectory like _change_data ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is the proposal, yes. They are essentially required to read the data files, so imo it makes sense to store them alongside the data files. Except not taking partition hierarchy into account due to the "many DVs per file"-thingy.

@scottsand-db scottsand-db requested a review from tdas October 13, 2022 18:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants