Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scan Delete Support Part 3: ArrowReader::build_deletes_row_selection implementation #951

Merged

Conversation

sdd
Copy link
Contributor

@sdd sdd commented Feb 8, 2025

Third part of delete file read support. See #630

**Builds on top of #950

build_deletes_row_selection computes a RowSelection from a RoaringTreemap representing the indexes of rows in a data file that have been marked as deleted by positional delete files that apply to the data file being read (and, in the future, delete vectors).

The resulting RowSelection will be merged with a RowSelection resulting from the scan's filter predicate (if present) and supplied to the ParquetRecordBatchStreamBuilder so that deleted rows are omitted from the RecordBatchStream returned by the reader.

NB: I encountered quite a few edge cases in this method and the logic is quite complex. There is a good chance that a keen-eyed reviewer would be able to conceive of an edge-case that I haven't covered.

@sdd sdd changed the title Feat/build deletes row selection implementation ArrowReader::build_deletes_row_selection implementation Feb 8, 2025
@sdd sdd force-pushed the feat/build-deletes-row-selection-implementation branch 3 times, most recently from f4b6d94 to a52fe50 Compare February 9, 2025 12:17
@sdd sdd force-pushed the feat/build-deletes-row-selection-implementation branch 2 times, most recently from 2fc5f70 to 26dc78f Compare February 10, 2025 09:00
@sdd sdd marked this pull request as ready for review February 19, 2025 06:28
@sdd sdd changed the title ArrowReader::build_deletes_row_selection implementation Scan Delete Support Part 3: ArrowReader::build_deletes_row_selection implementation Feb 21, 2025
Copy link
Contributor

@jonathanc-n jonathanc-n left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good, need more eyes, might've missed something

@sdd sdd force-pushed the feat/build-deletes-row-selection-implementation branch 4 times, most recently from 2029d21 to 79b2162 Compare March 5, 2025 19:42
liurenjie1024 pushed a commit that referenced this pull request Mar 20, 2025
…se in `ArrowReader` (#950)

Second part of delete file read support. See
#630.

This PR provides the basis for delete file support within `ArrowReader`.

`DeleteFileManager` is introduced, in skeleton form. Full implementation
of its behaviour will be submitted in follow-up PRs.

`DeleteFileManager` is responsible for loading and parsing positional
and equality delete files from `FileIO`. Once delete files for a task
have been loaded and parsed, `ArrowReader::process_file_scan_task` uses
the resulting `DeleteFileManager` in two places:

* `DeleteFileManager::get_delete_vector_for_task` is passed a data file
path and will return an ~`Option<Vec<usize>>`~ `Option<RoaringTreeMap>`
containing the indices of all rows that are positionally deleted in that
data file (or `None` if there are none)
* `DeleteFileManager::build_delete_predicate` is invoked with the schema
from the file scan task. It will return an `Option<BoundPredicate>`
representing the filter predicate derived from all of the applicable
equality deletes being transformed into predicates, logically joined
into a single predicate and then bound to the schema (or `None` if there
are no applicable equality deletes)


This PR integrates the skeleton of the `DeleteFileManager` into
`ArrowReader::process_file_scan_task`, extending the `RowFilter` and
`RowSelection` logic to take into account any `RowFilter` that results
from equality deletes and any `RowSelection` that results from
positional deletes.

## Updates:
* refactored `DeleteFileManager` so that
`get_positional_delete_indexes_for_data_file` returns a `RoaringTreemap`
rather than a `Vec<usize>`. This was based on @liurenjie1024's
recommendation in a comment on the v1 PR, and makes a lot of sense from
a performance perspective and made it easier to implement
`ArrowReader::build_deletes_row_selection` in the follow-up PR to this
one, #951
* `DeleteFileManager` is instantiated in the `ArrowReader` constructor
rather than per-scan-task, so that delete files that apply to more than
one task don't end up getting loaded and parsed twice

## Potential further enhancements:
* Go one step further and move loading of delete files, and parsing of
positional delete files, into `ObjectCache` to ensure that loading and
parsing of the same files persists across scans
@liurenjie1024
Copy link
Contributor

cc @sdd Would you help to resolve conflicts first?

@sdd sdd force-pushed the feat/build-deletes-row-selection-implementation branch 2 times, most recently from 51ce760 to 309a6ea Compare March 27, 2025 22:55
@@ -83,7 +83,7 @@ port_scanner = "0.1.5"
rand = "0.8.5"
regex = "1.10.5"
reqwest = { version = "0.12.2", default-features = false, features = ["json"] }
roaring = "0.10"
roaring = { version = "0.10", git = "https://github.com/RoaringBitmap/roaring-rs.git" }
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Waiting on this to be released by roaring-rs so that we don't need the git ref here

Copy link
Contributor

@liurenjie1024 liurenjie1024 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @sdd for this pr, just one minor suggestion to improve tests, others looks great!

@@ -1593,4 +1693,148 @@ message schema {

(file_io, schema, table_location, tmp_dir)
}

#[test]
fn test_build_deletes_row_selection() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test is not enough, should we consider using property test to generate random data to verify it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good suggestion. Will do.

@sdd sdd force-pushed the feat/build-deletes-row-selection-implementation branch from 309a6ea to caeb720 Compare April 3, 2025 07:28
Copy link
Contributor

@liurenjie1024 liurenjie1024 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @sdd , LGTM!

@liurenjie1024 liurenjie1024 merged commit 04a0d07 into apache:main Apr 7, 2025
18 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants