Skip to content
This repository has been archived by the owner on Feb 18, 2024. It is now read-only.

Refactored parquet statistics deserialization #962

Merged
merged 3 commits into from
Apr 25, 2022
Merged

Refactored parquet statistics deserialization #962

merged 3 commits into from
Apr 25, 2022

Conversation

jorgecarleitao
Copy link
Owner

@jorgecarleitao jorgecarleitao commented Apr 25, 2022

This PR modifies the APIs around parquet statistics to make it easier for consumers to use them.

Background

Parquet statistics are stored on a per-row group basis, as two values (min,max) per (group, column chunk).

We currently deserialize them to arrow so that arrow-dedicated operators can be used on them based on their logical type, e.g. for filter pushdown. However, we do it on a row group per row group basis (i.e. one value per row group).

This has 3 problems:

  1. most consumers of arrow2 have operators based on arrow arrays, not individual values. This usually means that they need to write dedicated operators for handling (scalar) statistics.
  2. we need to maintain physical representations specific for parquet statistics
  3. we need to maintain parquet statistics-specific logic to handle nested types (we currently do not support this)

This PR

This PR refactors the whole parquet statistics API to leverage arrow arrays. The core ideas are:

  • because a parquet file is composed by multiple row groups, we create 2 (min and max) arrow arrays per column containing the statistics of all the row groups in the file - the length of the arrays equal the number of row groups in the file.
  • nested types are deserialized to the corresponding arrow-equivalent nested types.

For example, a single row group with a column of int64 has

use arrow2::parquet::read::statistics::{Statistics, Count};

Statistics {
    distinct_count: Count::Single(UInt64Array::from([None])),
    null_count: Count::Single(UInt64Array::from([Some(3)])),
    min_value: Box::new(Int64Array::from_slice([-256])),
    max_value: Box::new(Int64Array::from_slice([9])),
}

Statistics has the following invariants:

  • null_count.len() == distinct_count.len() == min_value.len() == max_value.len()
  • min_value.data_type() == max_value.data_type()

The idea is that consumers can easily compute row group pruning via the min_value and max_value.

Note that Count can either be UInt64Array or StructArray, since Struct arrays have null and distinct counts for each of its fields

@codecov
Copy link

codecov bot commented Apr 25, 2022

Codecov Report

Merging #962 (c9fe387) into main (4860ee5) will increase coverage by 0.49%.
The diff coverage is 82.55%.

@@            Coverage Diff             @@
##             main     #962      +/-   ##
==========================================
+ Coverage   71.04%   71.53%   +0.49%     
==========================================
  Files         351      355       +4     
  Lines       19446    19680     +234     
==========================================
+ Hits        13815    14078     +263     
+ Misses       5631     5602      -29     
Impacted Files Coverage Δ
src/io/parquet/write/mod.rs 58.51% <45.45%> (-1.96%) ⬇️
src/io/parquet/read/statistics/dictionary.rs 46.15% <46.15%> (ø)
src/io/parquet/read/statistics/list.rs 55.17% <55.17%> (ø)
src/io/parquet/read/statistics/struct_.rs 61.11% <61.11%> (ø)
src/io/parquet/write/primitive/basic.rs 84.21% <66.66%> (+6.85%) ⬆️
src/io/parquet/write/primitive/nested.rs 83.33% <66.66%> (-3.04%) ⬇️
src/io/parquet/read/statistics/utf8.rs 81.81% <81.81%> (ø)
src/io/parquet/read/statistics/mod.rs 94.07% <94.07%> (+15.13%) ⬆️
src/io/parquet/read/statistics/binary.rs 100.00% <100.00%> (+34.48%) ⬆️
src/io/parquet/read/statistics/boolean.rs 100.00% <100.00%> (+33.33%) ⬆️
... and 16 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4860ee5...c9fe387. Read the comment docs.

@jorgecarleitao jorgecarleitao added the feature A new feature label Apr 25, 2022
@jorgecarleitao jorgecarleitao merged commit bb4f7d8 into main Apr 25, 2022
@jorgecarleitao jorgecarleitao deleted the stats2 branch April 25, 2022 19:33
@jorgecarleitao jorgecarleitao changed the title Improved parquet stats deserialization Refactored parquet statistics deserialization Apr 27, 2022
@jorgecarleitao jorgecarleitao added backwards-incompatible and removed feature A new feature labels Apr 27, 2022
ygf11 pushed a commit to ygf11/arrow2 that referenced this pull request Apr 28, 2022
* Fixed stats

* Fixed error in decimal stats

* Fixed struct stats
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant