Reading dictionary-encoded string columns with null values from multi-page parquet files yields misaligned data #57

corrego · 2020-05-08T21:00:02Z

Version: Parquet.Net from v3.9.9 at least

Runtime Version: .Net Framework v4.7.2

OS: Windows

Expected behavior

Data should have the correct values across pages

Actual behavior

When reading a dictionary-encoded column from a multi-page file with null values, there is a chance extra data will be read when decoding dictionary indexes. This is because the decoding function will read up to Num_values items for that page. However, in the presence of nulls, the total number of valid elements will be smaller than Num_values, but because the decoding function doesn't know this, it will continue generating elements until it runs out of data, putting these extra elements in the lookup table and causing data misalignment issues for the pages that follow.

This bug is especially insidious because the first page of data is correctly loaded.

I'm attaching a tentative fix PR that uses statistics to calculate the number of valid items that should be read.

The text was updated successfully, but these errors were encountered:

…ues (aloneguid#57) - added test and test data file

* Optimize page reading by not create array from bytes owner * - fixes misaligned data from dictionary-encoded columns with null values (#57) - added test and test data file Co-authored-by: Chirag Gupta (AZURE) <chgupt@microsoft.com> Co-authored-by: Carlos Orrego <carlos.orrego@thetradedesk.com>

corrego pushed a commit to corrego/parquet-dotnet-1 that referenced this issue May 8, 2020

- fixes misaligned data from dictionary-encoded columns with null val…

d6bc38d

…ues (aloneguid#57) - added test and test data file

corrego mentioned this issue May 8, 2020

Dictionary nulls multi page fix #58

Merged

3 tasks

aloneguid closed this as completed Dec 24, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reading dictionary-encoded string columns with null values from multi-page parquet files yields misaligned data #57

Reading dictionary-encoded string columns with null values from multi-page parquet files yields misaligned data #57

corrego commented May 8, 2020

Reading dictionary-encoded string columns with null values from multi-page parquet files yields misaligned data #57

Reading dictionary-encoded string columns with null values from multi-page parquet files yields misaligned data #57

Comments

corrego commented May 8, 2020

Expected behavior

Actual behavior