-
Notifications
You must be signed in to change notification settings - Fork 225
Delete Files in Table Scans #630
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Hi, I've recently implemented merge on read in my library using iceberg rust and submitted a working simplified version of the code, which looks somewhat similar to the This pr is #625 About this issue. I have some doubts. |
I'm happy to add the partitioning result to the task. This is useful to the executor node when deciding how to distribute tasks, as it enables the use of a few different strategies, the choice of which can be left to the implementer. It is not necessarily the case that the delete file is read repeatedly if the delete file list is added to the file scan task, since we can store the parsed delete files inside the object cache, preventing them from being read repeatedly on the same node as they'd already be in memory. If the executor ensures that all tasks with the same partition get sent to the same executor, then the files would only be read once. |
Thanks @sdd for raising this. The general approach looks good to me. Challenging part of deletion file processing is to filter unnecessary deletion files in each task, which we can introduce as optimization later. |
Thanks - I have some skeleton code for the required changes to reader.rs that I'm going to share over the next few days as well. |
Thanks for taking a look at the above, @liurenjie1024. I've just submitted a draft PR which outlines the second part of the approach - how we extend the filtering in the arrow reader to handle delete files. #652 @liurenjie1024, @Xuanwo, @ZENOTME, @xxhZs: if you could take a look at that PR also when you get chance and let me know if you think that the approach seems sensible, that would be great! |
Hi all. I'm resurrecting this issue now that @Fokko has kindly helped get the first part of this over the line by reviewing and merging #652. I have a branch with an earlier iteration of delete file read support that I'm intending to break up into pieces and submit as separate PRs. There are parts of it that I'm happy with and other parts that I'm less happy with; plus now would be a good opportunity to discuss the higher-level structure of the approach for this again now that I've got a better idea of the different parts of work that are involved. Outline
|
OK, I have an improved design for loading of delete files in the read pgase that I'll share shortly. We introduce a DeleteFileManager, constructed when ArrowReader gets built and provided with a FileIO. Reader keeps an Arc of this that it clones and passes to process_file_scan_task. process_file_scan_task calls an async method of DeleteFileManager, passing in the delete file list for its file scan task. DeleteFileManager loads and processes the delete files, deduplicating between multiple file scan tasks that reference the same delete files. DeleteFileManager exposes two methods that process_file_scan_task calls later on - one to retrieve the list of positional delete row indices that apply to a specified data file, and another to get a filter predicate derived from the applicable delete files. |
Thanks for this great job! @sdd Should we also consider the case that |
@ZENOTME I think that we'll want to do that at some point but it feels more of a day 2 task. We're not touching the disk anywhere in the library so far, as far as I know, and so it would need some careful consideration. |
I've worked on an improved design for loading and parsing of delete files by the
|
…se in `ArrowReader` (#950) Second part of delete file read support. See #630. This PR provides the basis for delete file support within `ArrowReader`. `DeleteFileManager` is introduced, in skeleton form. Full implementation of its behaviour will be submitted in follow-up PRs. `DeleteFileManager` is responsible for loading and parsing positional and equality delete files from `FileIO`. Once delete files for a task have been loaded and parsed, `ArrowReader::process_file_scan_task` uses the resulting `DeleteFileManager` in two places: * `DeleteFileManager::get_delete_vector_for_task` is passed a data file path and will return an ~`Option<Vec<usize>>`~ `Option<RoaringTreeMap>` containing the indices of all rows that are positionally deleted in that data file (or `None` if there are none) * `DeleteFileManager::build_delete_predicate` is invoked with the schema from the file scan task. It will return an `Option<BoundPredicate>` representing the filter predicate derived from all of the applicable equality deletes being transformed into predicates, logically joined into a single predicate and then bound to the schema (or `None` if there are no applicable equality deletes) This PR integrates the skeleton of the `DeleteFileManager` into `ArrowReader::process_file_scan_task`, extending the `RowFilter` and `RowSelection` logic to take into account any `RowFilter` that results from equality deletes and any `RowSelection` that results from positional deletes. ## Updates: * refactored `DeleteFileManager` so that `get_positional_delete_indexes_for_data_file` returns a `RoaringTreemap` rather than a `Vec<usize>`. This was based on @liurenjie1024's recommendation in a comment on the v1 PR, and makes a lot of sense from a performance perspective and made it easier to implement `ArrowReader::build_deletes_row_selection` in the follow-up PR to this one, #951 * `DeleteFileManager` is instantiated in the `ArrowReader` constructor rather than per-scan-task, so that delete files that apply to more than one task don't end up getting loaded and parsed twice ## Potential further enhancements: * Go one step further and move loading of delete files, and parsing of positional delete files, into `ObjectCache` to ensure that loading and parsing of the same files persists across scans
…` implementation (#951) Third part of delete file read support. See #630 **Builds on top of #950 `build_deletes_row_selection` computes a `RowSelection` from a `RoaringTreemap` representing the indexes of rows in a data file that have been marked as deleted by positional delete files that apply to the data file being read (and, in the future, delete vectors). The resulting `RowSelection` will be merged with a `RowSelection` resulting from the scan's filter predicate (if present) and supplied to the `ParquetRecordBatchStreamBuilder` so that deleted rows are omitted from the `RecordBatchStream` returned by the reader. NB: I encountered quite a few edge cases in this method and the logic is quite complex. There is a good chance that a keen-eyed reviewer would be able to conceive of an edge-case that I haven't covered. --------- Co-authored-by: Renjie Liu <liurenjie2008@gmail.com>
I'm looking to start work on proper handling of delete files in table scans and so I'd like to open an issue to discuss some of the design decisions.
A core tenet of our approach so far has been to ensure that the tasks produced by the file plan are small, independent and self-contained, so that they can be easily distributed in architectures where the service that generates the file plan could be on a different machine to the service(s) that perform the file reads.
The
FileScanTask
struct represents these individual units of work at present. Currently though, it's shape is focussed on Data files and it does not cater for including information on Delete files that are produced by the scan. Here's how it looks now, for reference:iceberg-rust/crates/iceberg/src/scan.rs
Lines 859 to 886 in cde35ab
In order to properly process delete files as part of executing a scan task, executors will now need to load in any applicable delete files along with the data file that they are processing. I'll outline what happens now, and follow that by my proposed approach.
Current TableScan Synopsis
The current structure pushes all manifest file entries from the manifest list into a stream which we then process concurrently in order to retrieve their associated manifests. Once retrieved, each manifest then has each of it's manifest entries extracted and pushed onto a channel so that they can be processed in parallel. Each is embedded inside a context object that contains the relevant information that is needed for processing of the manifest entry. Tokio tasks listening to the channel then execute
TableScan::process_manifest_entry
on these objects, where we filter out any entries that do not match the scan filter predicate.At this point, a
FileScanTask
is created for each of those entries that match the scan predicate. TheFileScanTask
s are then pushed into a channel that produces the stream ofFileScanTask
s that is returned to the original caller ofplan_files
.Changes to
TableScan
FileScanTask
Each
FileScanTask
represents a scan to be performed on a single data file. However, multiple delete files may need to be applied to any one data file. Additionally, the scope of applicability of delete files is any data file within the same partition of the delete file - i.e. the same delete file can need to be applied to multiple data files. Thus an executor needs to know not just the data file that it is processing, but all of the delete files that are applicable to that data file.The first part of the set of changes that I'm proposing is refactor
FileScanTask
so that it represents a single data file and zero or more delete files.data_file_content
property would be removed - each task is implicitly about a file of typeData
.DeleteFileEntry
, would be added. It would look something like this:delete_files
property of typVec<DeleteFileEntry>
would be added toFileScanTask
to represent the delete files that are applicable to it's data file.TableScan::plan_files
and associated methodsWe need to update this logic in order to ensure that we can properly populate this new
delete_files
property. EachManifestEntryContext
will need the list delete files so that if the manifest entry that it encapsulates passes the filtering steps, it can populate the newdelete_files
property when it constructsFileScanTask
.A naive approach may be to simply build a list of all of the delete files referred to by the top-level manifest list and give references to this list to all
ManifestEntryContext
s so that, if any delete files are present then all of them are included in everyFileScanTask
. This would be a good first step - code that works inefficiently is better than code that does not work at all! It would also permit work to proceed on the execution side.Improvements could then be made to refine this approach to filter out inapplicable delete files that goes into each
FileScanTask
'sdelete_files
property.How does this sound so far, @liurenjie1024, @Xuanwo, @ZENOTME, @Fokko?
The text was updated successfully, but these errors were encountered: