Refactored parquet statistics deserialization #962

jorgecarleitao · 2022-04-25T07:04:38Z

This PR modifies the APIs around parquet statistics to make it easier for consumers to use them.

Background

Parquet statistics are stored on a per-row group basis, as two values (min,max) per (group, column chunk).

We currently deserialize them to arrow so that arrow-dedicated operators can be used on them based on their logical type, e.g. for filter pushdown. However, we do it on a row group per row group basis (i.e. one value per row group).

This has 3 problems:

most consumers of arrow2 have operators based on arrow arrays, not individual values. This usually means that they need to write dedicated operators for handling (scalar) statistics.
we need to maintain physical representations specific for parquet statistics
we need to maintain parquet statistics-specific logic to handle nested types (we currently do not support this)

This PR

This PR refactors the whole parquet statistics API to leverage arrow arrays. The core ideas are:

because a parquet file is composed by multiple row groups, we create 2 (min and max) arrow arrays per column containing the statistics of all the row groups in the file - the length of the arrays equal the number of row groups in the file.
nested types are deserialized to the corresponding arrow-equivalent nested types.

For example, a single row group with a column of int64 has

use arrow2::parquet::read::statistics::{Statistics, Count};

Statistics {
    distinct_count: Count::Single(UInt64Array::from([None])),
    null_count: Count::Single(UInt64Array::from([Some(3)])),
    min_value: Box::new(Int64Array::from_slice([-256])),
    max_value: Box::new(Int64Array::from_slice([9])),
}

Statistics has the following invariants:

null_count.len() == distinct_count.len() == min_value.len() == max_value.len()
min_value.data_type() == max_value.data_type()

The idea is that consumers can easily compute row group pruning via the min_value and max_value.

Note that Count can either be UInt64Array or StructArray, since Struct arrays have null and distinct counts for each of its fields

codecov · 2022-04-25T07:14:05Z

Codecov Report

Merging #962 (c9fe387) into main (4860ee5) will increase coverage by 0.49%.
The diff coverage is 82.55%.

@@            Coverage Diff             @@
##             main     #962      +/-   ##
==========================================
+ Coverage   71.04%   71.53%   +0.49%     
==========================================
  Files         351      355       +4     
  Lines       19446    19680     +234     
==========================================
+ Hits        13815    14078     +263     
+ Misses       5631     5602      -29

Impacted Files	Coverage Δ
src/io/parquet/write/mod.rs	`58.51% <45.45%> (-1.96%)`	⬇️
src/io/parquet/read/statistics/dictionary.rs	`46.15% <46.15%> (ø)`
src/io/parquet/read/statistics/list.rs	`55.17% <55.17%> (ø)`
src/io/parquet/read/statistics/struct_.rs	`61.11% <61.11%> (ø)`
src/io/parquet/write/primitive/basic.rs	`84.21% <66.66%> (+6.85%)`	⬆️
src/io/parquet/write/primitive/nested.rs	`83.33% <66.66%> (-3.04%)`	⬇️
src/io/parquet/read/statistics/utf8.rs	`81.81% <81.81%> (ø)`
src/io/parquet/read/statistics/mod.rs	`94.07% <94.07%> (+15.13%)`	⬆️
src/io/parquet/read/statistics/binary.rs	`100.00% <100.00%> (+34.48%)`	⬆️
src/io/parquet/read/statistics/boolean.rs	`100.00% <100.00%> (+33.33%)`	⬆️
... and 16 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4860ee5...c9fe387. Read the comment docs.

* Fixed stats * Fixed error in decimal stats * Fixed struct stats

jorgecarleitao added 2 commits April 24, 2022 22:00

Fixed stats

3df39d7

Fixed error in decimal stats

218a031

jorgecarleitao added the feature A new feature label Apr 25, 2022

Fixed struct stats

c9fe387

jorgecarleitao force-pushed the stats2 branch from c06a80e to c9fe387 Compare April 25, 2022 15:45

jorgecarleitao merged commit bb4f7d8 into main Apr 25, 2022

jorgecarleitao deleted the stats2 branch April 25, 2022 19:33

jorgecarleitao changed the title ~~Improved parquet stats deserialization~~ Refactored parquet statistics deserialization Apr 27, 2022

jorgecarleitao added backwards-incompatible and removed feature A new feature labels Apr 27, 2022

ygf11 pushed a commit to ygf11/arrow2 that referenced this pull request Apr 28, 2022

Improved parquet stats deserialization (jorgecarleitao#962)

f0b45eb

* Fixed stats * Fixed error in decimal stats * Fixed struct stats

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactored parquet statistics deserialization #962

Refactored parquet statistics deserialization #962

jorgecarleitao commented Apr 25, 2022 •

edited

Loading

codecov bot commented Apr 25, 2022 •

edited

Loading

Refactored parquet statistics deserialization #962

Refactored parquet statistics deserialization #962

Conversation

jorgecarleitao commented Apr 25, 2022 • edited Loading

Background

This PR

codecov bot commented Apr 25, 2022 • edited Loading

Codecov Report

jorgecarleitao commented Apr 25, 2022 •

edited

Loading

codecov bot commented Apr 25, 2022 •

edited

Loading