You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When reading a dictionary-encoded column from a multi-page file with null values, there is a chance extra data will be read when decoding dictionary indexes. This is because the decoding function will read up to Num_values items for that page. However, in the presence of nulls, the total number of valid elements will be smaller than Num_values, but because the decoding function doesn't know this, it will continue generating elements until it runs out of data, putting these extra elements in the lookup table and causing data misalignment issues for the pages that follow.
This bug is especially insidious because the first page of data is correctly loaded.
I'm attaching a tentative fix PR that uses statistics to calculate the number of valid items that should be read.
The text was updated successfully, but these errors were encountered:
corrego
pushed a commit
to corrego/parquet-dotnet-1
that referenced
this issue
May 8, 2020
* Optimize page reading by not create array from bytes owner
* - fixes misaligned data from dictionary-encoded columns with null values (#57)
- added test and test data file
Co-authored-by: Chirag Gupta (AZURE) <chgupt@microsoft.com>
Co-authored-by: Carlos Orrego <carlos.orrego@thetradedesk.com>
Version: Parquet.Net from v3.9.9 at least
Runtime Version: .Net Framework v4.7.2
OS: Windows
Expected behavior
Data should have the correct values across pages
Actual behavior
When reading a dictionary-encoded column from a multi-page file with null values, there is a chance extra data will be read when decoding dictionary indexes. This is because the decoding function will read up to
Num_values
items for that page. However, in the presence of nulls, the total number of valid elements will be smaller thanNum_values
, but because the decoding function doesn't know this, it will continue generating elements until it runs out of data, putting these extra elements in the lookup table and causing data misalignment issues for the pages that follow.This bug is especially insidious because the first page of data is correctly loaded.
I'm attaching a tentative fix PR that uses statistics to calculate the number of valid items that should be read.
The text was updated successfully, but these errors were encountered: