Expose Dictionary to reader #1270

zeevm · 2022-02-04T13:28:45Z

Many Parquet query engines have optimizations that rely on Dictionary encoded columns, e.g. for selections with filter.

The Rust implementation of the Parquet reader makes it difficult for a reader to read dictionary encoded values because it doesn't expose the RLE decoder to the reader code, so a reader that wishes to work with dictionary values has to re-implement an RLE decoder to read values from dictionary encoded data pages.

This can be easily addressed by making the RLE code public outside the crate.

alamb · 2022-02-04T19:47:08Z

FYI I think @tustvold has some plans to contribute functionality that may be similar to the parqet crate directly in #1191

tustvold · 2022-02-04T20:24:45Z

I'd be very interested in any details you can share about your particular use-case, in particular if there is anyway we might be able to combine efforts in this space. The proposal in #1191 is just that, and any input you'd be willing to provide would be most appreciated 👍

If you're using arrow, I'd also potentially draw your attention to #1180 which will preserve the dictionary encoding present in the parquet file for dictionary arrays, and is slated for inclusion as the default behaviour in arrow 9.

zeevm · 2022-02-05T08:18:31Z

@alamb @tustvold My use case is a proprietary analytical DB engine, it has its' own proprietary storage format but also allows running queries against external formats like Parquet.

As it already has a highly optimized scan capability of dictionary encoded data, all I want is for it to have access to the raw Parquet dictionary.

I don't want to take a dependency on Arrow array for that as I'm not using Arrow at all, I don't deserialize Parquet into Arrow since the engine I'm working on has its' own in-memory representation (I don't even build Arrow with Parquet)

tustvold · 2022-02-05T08:41:45Z

Thank you for taking the time to respond, I figured that might be the case, but thought it couldn't hurt to check.

I'm sure you're aware, but just as a heads up if you're reading the data directly, the RLE encoding is not length preserving #1111 (comment), and a column chunk may not be consistently dictionary encoded (e.g. if the dictionary gets too large).

FWIW there were some generics added in #1041 and evolved since to aid decoding columns to custom in-memory representations. They're currently crate-local, but that could be changed if you wished to use them. Just let me know 😀

zeevm · 2022-02-05T10:55:59Z

@tustvold My engine assumes a column is either fully dictionary encoded or not, so for my use case I first have to scan the headers of all pages in the a column chunk to assert they're all dictionary encoded, if any of them are not (other than the dictionary page itself of course), I treat the column as not-dictionary encoded, meaning I'll read with a ColumnReader instead of a PageReader and let the library handle the variously encoded pages.

zeevm added the enhancement Any new improvement worthy of a entry in the changelog label Feb 4, 2022

zeevm mentioned this issue Feb 4, 2022

Make rle decoder public under experimental feature #1271

Merged

sunchao closed this as completed in #1271 Feb 6, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Expose Dictionary to reader #1270

Expose Dictionary to reader #1270

zeevm commented Feb 4, 2022

alamb commented Feb 4, 2022

tustvold commented Feb 4, 2022 •

edited

Loading

zeevm commented Feb 5, 2022

tustvold commented Feb 5, 2022 •

edited

Loading

zeevm commented Feb 5, 2022

Expose Dictionary to reader #1270

Expose Dictionary to reader #1270

Comments

zeevm commented Feb 4, 2022

alamb commented Feb 4, 2022

tustvold commented Feb 4, 2022 • edited Loading

zeevm commented Feb 5, 2022

tustvold commented Feb 5, 2022 • edited Loading

zeevm commented Feb 5, 2022

tustvold commented Feb 4, 2022 •

edited

Loading

tustvold commented Feb 5, 2022 •

edited

Loading