Skip to content
This repository has been archived by the owner on Feb 18, 2024. It is now read-only.

Added support to round-trip dictionary arrays on parquet #232

Merged
merged 2 commits into from
Aug 4, 2021

Conversation

jorgecarleitao
Copy link
Owner

@jorgecarleitao jorgecarleitao commented Jul 29, 2021

Closes #211. Dictionary-encoding is a popular and important technique to reduce memory usage.

Parquet has an encoding specifically for this (PlainDictionary and RleDictionary). Arrow has "DictionaryArray" specifically for this. This PR bridges these two by allowing arrow2 to read and write dictionary arrays to and from parquet.

The semantics is as follows:

When writing to parquet, it is now possible to use Encoding::PlainDictionary and Encoding::RleDictionary when writing DataType::Dictionary fields. This causes the array to be dictionary-encoded into parquet (two pages: one with values one with indices).

When reading from parquet, if the arrow schema on the metadata contains a field with DataType::Dictionary and the pages are dictionary-encoded, we read them to a DictionaryArray (all other cases error with not yet implemented).

As before, dictionary-encoded pages without an explicit DataType::Dictionary in the schema are read to their "natural", non-dictionary, DataType.

@codecov
Copy link

codecov bot commented Jul 29, 2021

Codecov Report

Merging #232 (063ede8) into main (fa3c2ce) will decrease coverage by 0.05%.
The diff coverage is 66.40%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #232      +/-   ##
==========================================
- Coverage   76.81%   76.76%   -0.06%     
==========================================
  Files         229      231       +2     
  Lines       19617    19768     +151     
==========================================
+ Hits        15068    15174     +106     
- Misses       4549     4594      +45     
Impacted Files Coverage Δ
src/io/parquet/write/binary/nested.rs 95.65% <ø> (ø)
src/io/parquet/write/boolean/basic.rs 97.56% <ø> (ø)
src/io/parquet/write/boolean/nested.rs 95.65% <ø> (ø)
src/io/parquet/write/fixed_len_bytes.rs 88.46% <ø> (ø)
src/io/parquet/write/primitive/nested.rs 95.65% <ø> (ø)
src/io/parquet/write/utf8/nested.rs 95.65% <ø> (ø)
src/io/parquet/write/dictionary.rs 43.24% <43.24%> (ø)
src/io/parquet/read/fixed_size_binary.rs 43.00% <50.00%> (ø)
src/io/parquet/write/mod.rs 68.56% <58.62%> (-1.35%) ⬇️
src/io/parquet/read/primitive/dictionary.rs 70.14% <70.14%> (ø)
... and 31 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update fa3c2ce...063ede8. Read the comment docs.

@jorgecarleitao jorgecarleitao added backwards-incompatible enhancement An improvement to an existing feature labels Jul 29, 2021
@jorgecarleitao jorgecarleitao force-pushed the parquet_dict branch 3 times, most recently from 29554cc to 00600e0 Compare August 3, 2021 04:51
@jorgecarleitao jorgecarleitao merged commit 92e2277 into main Aug 4, 2021
@jorgecarleitao jorgecarleitao deleted the parquet_dict branch August 4, 2021 16:47
@jorgecarleitao jorgecarleitao added feature A new feature and removed enhancement An improvement to an existing feature labels Aug 11, 2021
@jorgecarleitao jorgecarleitao changed the title Added experimental support to round-trip dictionary arrays on parquet Added support to round-trip dictionary arrays on parquet Aug 11, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
feature A new feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add support to write dictionary-encoded pages
1 participant