[Feature Request][Spark] Support MERGE command with Deletion Vectors #2296

tdas · 2023-11-15T17:25:12Z

Feature request

Which Delta project/connector is this regarding?

Overview

This FR is about providing deletion vector support in Merge. It is part of a wider effort to speed up DML operations with Deletion Vectors (DVs). It builds on top of previous work: #1591 and #1923.

Motivation

The current implementation of merge is based on the Copy-on-Write (CoW) approach where touched files are rewritten entirely. This includes both the modified rows as well as the unmodified rows. On the other hand, deletion vectors allow a Merge-on-Read (MoR) approach where we "soft" delete the affected rows in the touched files and only rewrite the modified rows. The "soft" deleted rows are then filtered out on read. This can result into significant performance gains during writes by trading off a small overhead on read. This is because on the most common case merge operations only touch a small portion of data.

Further details

The current implementation implementation of merge consists of 2 jobs:

Job 1: Finds touched files by joining the source and target tables.
Job 2: Rewrites touched files.

The new implementation splits job 2 into two parts: one for writing the modified rows and one for writing the deletion vectors. Overall, merge with DVs consists of the following jobs:

Job 1: Finds touched files by joining the source and target tables.
Job 2.a: Writes modified and new rows.
Job 2.b: writes deletions vectors for the modified rows.

From a performance point of view, the extra job adds some overhead but only operates on the touched files produced by job 1 and only shuffles the columns required by the predicates. Furthermore, jobs 2.a and 2.b perform stricter joins.

tdas · 2024-01-04T17:44:30Z

This is being implemented by @andreaschat-db in #2428

tdas added the enhancement New feature or request label Nov 15, 2023

tdas added this to the 3.1.0 milestone Nov 15, 2023

tdas added this to Linux Foundation Delta Lake Roadmap Nov 15, 2023

tdas moved this to Todo in Linux Foundation Delta Lake Roadmap Nov 15, 2023

felipepessoto mentioned this issue Nov 17, 2023

[Feature Request] Support UPDATE command with Deletion Vectors #1923

Closed

8 tasks

felipepessoto mentioned this issue Nov 27, 2023

[Feature Request] Enable No or Low Shuffle MERGE in OSS Delta #2138

Open

8 tasks

felipepessoto mentioned this issue Jan 3, 2024

[Feature Request][Spark] Add support for Deletion Vectors to Merge #2426

Closed

8 tasks

tdas mentioned this issue Jan 4, 2024

[#2296][Spark] Add support for Deletion Vectors to Merge #2428

Closed

5 tasks

tdas moved this from Todo to In Progress in Linux Foundation Delta Lake Roadmap Jan 5, 2024

This was referenced Jan 8, 2024

[Feature][Spark] Add deletion vector metrics in Merge #2441

Open

[#2441][Spark] Add Deletion Vectors Metrics in Merge #2453

Closed

vkorukanti closed this as completed Jan 30, 2024

github-project-automation bot moved this from In Progress to Done in Linux Foundation Delta Lake Roadmap Jan 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Request][Spark] Support MERGE command with Deletion Vectors #2296

[Feature Request][Spark] Support MERGE command with Deletion Vectors #2296

tdas commented Nov 15, 2023 •

edited

Loading

tdas commented Jan 4, 2024

[Feature Request][Spark] Support MERGE command with Deletion Vectors #2296

[Feature Request][Spark] Support MERGE command with Deletion Vectors #2296

Comments

tdas commented Nov 15, 2023 • edited Loading

Feature request

Which Delta project/connector is this regarding?

Overview

Motivation

Further details

tdas commented Jan 4, 2024

tdas commented Nov 15, 2023 •

edited

Loading