-
Notifications
You must be signed in to change notification settings - Fork 226
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Scan Delete Support Part 3: ArrowReader::build_deletes_row_selection
implementation
#951
Scan Delete Support Part 3: ArrowReader::build_deletes_row_selection
implementation
#951
Conversation
ArrowReader::build_deletes_row_selection
implementation
f4b6d94
to
a52fe50
Compare
2fc5f70
to
26dc78f
Compare
ArrowReader::build_deletes_row_selection
implementationArrowReader::build_deletes_row_selection
implementation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks good, need more eyes, might've missed something
2029d21
to
79b2162
Compare
…se in `ArrowReader` (#950) Second part of delete file read support. See #630. This PR provides the basis for delete file support within `ArrowReader`. `DeleteFileManager` is introduced, in skeleton form. Full implementation of its behaviour will be submitted in follow-up PRs. `DeleteFileManager` is responsible for loading and parsing positional and equality delete files from `FileIO`. Once delete files for a task have been loaded and parsed, `ArrowReader::process_file_scan_task` uses the resulting `DeleteFileManager` in two places: * `DeleteFileManager::get_delete_vector_for_task` is passed a data file path and will return an ~`Option<Vec<usize>>`~ `Option<RoaringTreeMap>` containing the indices of all rows that are positionally deleted in that data file (or `None` if there are none) * `DeleteFileManager::build_delete_predicate` is invoked with the schema from the file scan task. It will return an `Option<BoundPredicate>` representing the filter predicate derived from all of the applicable equality deletes being transformed into predicates, logically joined into a single predicate and then bound to the schema (or `None` if there are no applicable equality deletes) This PR integrates the skeleton of the `DeleteFileManager` into `ArrowReader::process_file_scan_task`, extending the `RowFilter` and `RowSelection` logic to take into account any `RowFilter` that results from equality deletes and any `RowSelection` that results from positional deletes. ## Updates: * refactored `DeleteFileManager` so that `get_positional_delete_indexes_for_data_file` returns a `RoaringTreemap` rather than a `Vec<usize>`. This was based on @liurenjie1024's recommendation in a comment on the v1 PR, and makes a lot of sense from a performance perspective and made it easier to implement `ArrowReader::build_deletes_row_selection` in the follow-up PR to this one, #951 * `DeleteFileManager` is instantiated in the `ArrowReader` constructor rather than per-scan-task, so that delete files that apply to more than one task don't end up getting loaded and parsed twice ## Potential further enhancements: * Go one step further and move loading of delete files, and parsing of positional delete files, into `ObjectCache` to ensure that loading and parsing of the same files persists across scans
cc @sdd Would you help to resolve conflicts first? |
51ce760
to
309a6ea
Compare
@@ -83,7 +83,7 @@ port_scanner = "0.1.5" | |||
rand = "0.8.5" | |||
regex = "1.10.5" | |||
reqwest = { version = "0.12.2", default-features = false, features = ["json"] } | |||
roaring = "0.10" | |||
roaring = { version = "0.10", git = "https://github.com/RoaringBitmap/roaring-rs.git" } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Waiting on this to be released by roaring-rs so that we don't need the git ref here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @sdd for this pr, just one minor suggestion to improve tests, others looks great!
@@ -1593,4 +1693,148 @@ message schema { | |||
|
|||
(file_io, schema, table_location, tmp_dir) | |||
} | |||
|
|||
#[test] | |||
fn test_build_deletes_row_selection() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test is not enough, should we consider using property test to generate random data to verify it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good suggestion. Will do.
309a6ea
to
caeb720
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @sdd , LGTM!
Third part of delete file read support. See #630
**Builds on top of #950
build_deletes_row_selection
computes aRowSelection
from aRoaringTreemap
representing the indexes of rows in a data file that have been marked as deleted by positional delete files that apply to the data file being read (and, in the future, delete vectors).The resulting
RowSelection
will be merged with aRowSelection
resulting from the scan's filter predicate (if present) and supplied to theParquetRecordBatchStreamBuilder
so that deleted rows are omitted from theRecordBatchStream
returned by the reader.NB: I encountered quite a few edge cases in this method and the logic is quite complex. There is a good chance that a keen-eyed reviewer would be able to conceive of an edge-case that I haven't covered.