[Feature Request][Spark] Add support for Deletion Vectors to Merge #2426

andreaschat-db · 2024-01-03T10:07:17Z

Feature request

Which Delta project/connector is this regarding?

Overview

This FR is about providing deletion vector support in Merge. It is part of a wider effort to speed up DML operations with Deletion Vectors (DVs). It builds on top of previous work: #1591 and #1923.

Motivation

The current implementation of merge is based on the Copy-on-Write (CoW) approach where touched files are rewritten entirely. This includes both the modified rows as well as the unmodified rows. On the other hand, deletion vectors allow a Merge-on-Read (MoR) approach where we "soft" delete the affected rows in the touched files and only rewrite the modified rows. The "soft" deleted rows are then filtered out on read. This can result into significant performance gains during writes by trading off a small overhead on read. This is because on the most common case merge operations only touch a small portion of data.

Further details

The current implementation implementation of merge consists of 2 jobs:

Job 1: Finds touched files by joining the source and target tables.
Job 2: Rewrites touched files.

The new implementation splits job 2 into two parts: one for writing the modified rows and one for writing the deletion vectors. Overall, merge with DVs consists of the following jobs:

Job 1: Finds touched files by joining the source and target tables.
Job 2.a: Writes modified and new rows.
Job 2.b: writes deletions vectors for the modified rows.

From a performance point of view, the extra job adds some overhead but only operates on the touched files produced by job 1 and only shuffles the columns required by the predicates. Furthermore, jobs 2.a and 2.b perform stricter joins.

Willingness to contribute

The Delta Lake Community encourages new feature contributions. Would you or another member of your organization be willing to contribute an implementation of this feature?

Yes. I can contribute this feature independently.
Yes. I would be willing to contribute this feature with guidance from the Delta Lake community.
No. I cannot contribute this feature at this time.

The text was updated successfully, but these errors were encountered:

felipepessoto · 2024-01-03T19:49:38Z

Is this duplicate of #2296?

andreaschat-db · 2024-01-04T09:10:15Z

Is this duplicate of #2296?

Yes it is. Thanks for pointing out! I will fix it.

tdas · 2024-01-04T14:48:17Z

Since the previous one exists... @andreaschat-db could you please use that one (and close this one)?
In general, it is a good idea to check whether a similar issue exists or not before creating a new one.

andreaschat-db · 2024-01-04T15:29:10Z

Since the previous one exists... @andreaschat-db could you please use that one (and close this one)? In general, it is a good idea to check whether a similar issue exists or not before creating a new one.

@tdas In that case, please update the description of the original ticket and I will close this one.

tdas · 2024-01-04T17:43:03Z

I updated the description in that one. and commented that you are implementing this. you can close this now.

andreaschat-db · 2024-01-04T18:28:25Z

Closing as a duplicate of #2296.

andreaschat-db added the enhancement New feature or request label Jan 3, 2024

andreaschat-db changed the title ~~[Feature Request][Spark] Deletion Vectors support in Merge~~ [Feature Request][Spark] Add support for Deletion Vectors to Merge Jan 3, 2024

andreaschat-db closed this as completed Jan 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Request][Spark] Add support for Deletion Vectors to Merge #2426

[Feature Request][Spark] Add support for Deletion Vectors to Merge #2426

andreaschat-db commented Jan 3, 2024 •

edited

Loading

felipepessoto commented Jan 3, 2024

andreaschat-db commented Jan 4, 2024

tdas commented Jan 4, 2024

andreaschat-db commented Jan 4, 2024

tdas commented Jan 4, 2024

andreaschat-db commented Jan 4, 2024

[Feature Request][Spark] Add support for Deletion Vectors to Merge #2426

[Feature Request][Spark] Add support for Deletion Vectors to Merge #2426

Comments

andreaschat-db commented Jan 3, 2024 • edited Loading

Feature request

Which Delta project/connector is this regarding?

Overview

Motivation

Further details

Willingness to contribute

felipepessoto commented Jan 3, 2024

andreaschat-db commented Jan 4, 2024

tdas commented Jan 4, 2024

andreaschat-db commented Jan 4, 2024

tdas commented Jan 4, 2024

andreaschat-db commented Jan 4, 2024

andreaschat-db commented Jan 3, 2024 •

edited

Loading