Skip to content
This repository has been archived by the owner on Feb 18, 2024. It is now read-only.

Added support for parquet sidecar to FileReader #1215

Merged
merged 4 commits into from
Aug 9, 2022
Merged

Conversation

jorgecarleitao
Copy link
Owner

@jorgecarleitao jorgecarleitao commented Aug 8, 2022

Initializing a FileReader currently:

  • is IO bounded, by reading the metadata
  • infers an Arrow schema (and fails if it can't infer one)
  • clones the FileMetadata to have it available for users
  • has an Arc<Fn> on it to filter row groups that most use-cases I saw so far is not useful

This PR refactors FileReader::try_new into FileReader::new by allowing its users more flexibility into how the file is read.

The main idea is that FileReader is now initialized as follows:

    // we can read its metadata:
    let metadata = read::read_metadata(&mut reader)?;

    // and infer a [`Schema`] from the `metadata`.
    let schema = read::infer_schema(&metadata)?;

    // we can filter the columns we need (here we select all)
    // (projection pushdown)
    let schema = schema.filter(|_index, _field| true);

    // we can read the statistics of all parquet's row groups (here for the first field)
    let statistics = read::statistics::deserialize(&schema.fields[0], &metadata.row_groups)?;

    // say we found that we only need to read the first two row groups, "0" and "1"
    // (row-group filter pushdown)
    let row_groups = metadata
        .row_groups
        .into_iter()
        .enumerate()
        .filter(|(index, _)| *index == 0 || *index == 1)
        .map(|(_, row_group)| row_group)
        .collect();

    // we can then read the row groups into chunks
    let chunks = read::FileReader::new(reader, row_groups, schema, Some(1024 * 8 * 8), None);

this gives the user the flexibility to perform the necessary preparation to the metadata (Schema and row_groups) to read the file. let row_groups = metadata.row_groups if no predictive pushdown is being used.

This also allows row groups defined somewhere else (e.g. a parquet sidecar) to be used.

@codecov
Copy link

codecov bot commented Aug 8, 2022

Codecov Report

Merging #1215 (34c3539) into main (56189bd) will increase coverage by 0.04%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##             main    #1215      +/-   ##
==========================================
+ Coverage   83.14%   83.18%   +0.04%     
==========================================
  Files         358      358              
  Lines       37297    37255      -42     
==========================================
- Hits        31010    30992      -18     
+ Misses       6287     6263      -24     
Impacted Files Coverage Δ
src/datatypes/schema.rs 80.00% <100.00%> (+26.15%) ⬆️
src/io/parquet/read/file.rs 90.62% <100.00%> (+3.04%) ⬆️
src/io/parquet/read/row_group.rs 100.00% <100.00%> (ø)
src/io/ipc/read/file.rs 97.32% <0.00%> (+0.44%) ⬆️
src/io/ipc/read/stream_async.rs 76.71% <0.00%> (+0.68%) ⬆️
src/io/ipc/read/file_async.rs 61.19% <0.00%> (+0.74%) ⬆️
src/array/utf8/mod.rs 83.64% <0.00%> (+0.92%) ⬆️
src/array/binary/mod.rs 90.12% <0.00%> (+1.23%) ⬆️
src/io/ipc/read/array/boolean.rs 98.14% <0.00%> (+7.40%) ⬆️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

@jorgecarleitao jorgecarleitao marked this pull request as ready for review August 8, 2022 21:49
@jorgecarleitao jorgecarleitao added feature A new feature and removed backwards-incompatible labels Aug 9, 2022
@jorgecarleitao jorgecarleitao changed the title Decoupled parquet's FileReader constructor from Metadata, GroupFilter and Schema Add support for parquet sidecar for FileReader Aug 9, 2022
@jorgecarleitao jorgecarleitao changed the title Add support for parquet sidecar for FileReader Add support for parquet sidecar to FileReader Aug 9, 2022
@jorgecarleitao jorgecarleitao changed the title Add support for parquet sidecar to FileReader Added support for parquet sidecar to FileReader Aug 9, 2022
@jorgecarleitao jorgecarleitao merged commit 838deca into main Aug 9, 2022
@jorgecarleitao jorgecarleitao deleted the differet_arrow branch August 9, 2022 04:29
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
feature A new feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant