Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Spark] Support predicate pushdown in scans with DVs #2982

Conversation

andreaschat-db
Copy link
Contributor

Which Delta project/connector is this regarding?

  • Spark
  • Standalone
  • Flink
  • Kernel
  • Other (fill in here)

Description

Currently, when Deletion Vectors are enabled we disable predicate pushdown and splitting in scans. This is because we rely on a custom row index column which is constructed in the executors and cannot not handle splits and predicates. These restrictions can now be lifted by relying instead on metadata.row_index which was exposed recently after relevant work was concluded.

Overall, this PR adds predicate pushdown and splits support as follows:

  1. Replaces __delta_internal_is_row_deleted with _metadata.row_index.
  2. Adds a new implementation of __delta_internal_is_row_deleted that is based on _metadata.row_index.
  3. IsRowDeleted filter is now non deterministic to allow predicate pushdown.

Furthermore, it includes previous relevant work to remove the UDF from IsRowDeleted filter.

How was this patch tested?

Added new suites.

Does this PR introduce any user-facing changes?

No.

flush
flush

First sane version without isRowDeleted
# This is the 1st commit message:

flush

# This is the commit message delta-io#2:

flush

# This is the commit message delta-io#3:

First sane version without isRowDeleted

# This is the commit message delta-io#4:

Hack RowIndexMarkingFilters

# This is the commit message delta-io#5:

Add support for non-vectorized readers

# This is the commit message delta-io#6:

Metadata column fix
flush

First sane version without isRowDeleted

Hack RowIndexMarkingFilters

Add support for non-vectorized readers

Metadata column fix

Avoid non-deterministic UDF to filter deleted rows
# This is the 1st commit message:

flush

# This is the commit message delta-io#2:

flush

# This is the commit message delta-io#3:

First sane version without isRowDeleted

# This is the commit message delta-io#4:

Hack RowIndexMarkingFilters

# This is the commit message delta-io#5:

Add support for non-vectorized readers

# This is the commit message delta-io#6:

Metadata column fix

# This is the commit message delta-io#7:

Avoid non-deterministic UDF to filter deleted rows

# This is the commit message delta-io#8:

metadata with Expression ID

# This is the commit message delta-io#9:

Fix complex views issue

# This is the commit message delta-io#10:

Tests

# This is the commit message delta-io#11:

cleaning

# This is the commit message delta-io#12:

More tests and fixes
flush

First sane version without isRowDeleted

Hack RowIndexMarkingFilters

Add support for non-vectorized readers

Metadata column fix

Avoid non-deterministic UDF to filter deleted rows

metadata with Expression ID

Fix complex views issue

Tests

cleaning

More tests and fixes

Partial cleaning
# This is the 1st commit message:

flush

# This is the commit message delta-io#2:

flush

# This is the commit message delta-io#3:

First sane version without isRowDeleted

# This is the commit message delta-io#4:

Hack RowIndexMarkingFilters

# This is the commit message delta-io#5:

Add support for non-vectorized readers

# This is the commit message delta-io#6:

Metadata column fix

# This is the commit message delta-io#7:

Avoid non-deterministic UDF to filter deleted rows

# This is the commit message delta-io#8:

metadata with Expression ID

# This is the commit message delta-io#9:

Fix complex views issue

# This is the commit message delta-io#10:

Tests

# This is the commit message delta-io#11:

cleaning

# This is the commit message delta-io#12:

More tests and fixes

# This is the commit message delta-io#13:

Partial cleaning

# This is the commit message delta-io#14:

cleaning and improvements

# This is the commit message delta-io#15:

cleaning and improvements

# This is the commit message delta-io#16:

Clean RowIndexFilter
flush

First sane version without isRowDeleted

Hack RowIndexMarkingFilters

Add support for non-vectorized readers

Metadata column fix

Avoid non-deterministic UDF to filter deleted rows

metadata with Expression ID

Fix complex views issue

Tests

cleaning

More tests and fixes

Partial cleaning

cleaning and improvements

cleaning and improvements

Clean RowIndexFilter

Clean DeltaParquetFileFormat

Improve DeletionVectorsSuite

Disable DeltaParquetFileFormatSuite for predicate pushdown.
@scottsand-db scottsand-db merged commit 9052462 into delta-io:branch-3.2 Apr 26, 2024
7 of 8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants