Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reading dictionary-encoded string columns with null values from multi-page parquet files yields misaligned data #57

Closed
corrego opened this issue May 8, 2020 · 0 comments

Comments

@corrego
Copy link
Contributor

corrego commented May 8, 2020

Version: Parquet.Net from v3.9.9 at least

Runtime Version: .Net Framework v4.7.2

OS: Windows

Expected behavior

Data should have the correct values across pages

Actual behavior

When reading a dictionary-encoded column from a multi-page file with null values, there is a chance extra data will be read when decoding dictionary indexes. This is because the decoding function will read up to Num_values items for that page. However, in the presence of nulls, the total number of valid elements will be smaller than Num_values, but because the decoding function doesn't know this, it will continue generating elements until it runs out of data, putting these extra elements in the lookup table and causing data misalignment issues for the pages that follow.

This bug is especially insidious because the first page of data is correctly loaded.

I'm attaching a tentative fix PR that uses statistics to calculate the number of valid items that should be read.

corrego pushed a commit to corrego/parquet-dotnet-1 that referenced this issue May 8, 2020
aloneguid pushed a commit that referenced this issue Jun 16, 2020
* Optimize page reading by not create array from bytes owner

* - fixes misaligned data from dictionary-encoded columns with null values (#57)

- added test and test data file

Co-authored-by: Chirag Gupta (AZURE) <chgupt@microsoft.com>
Co-authored-by: Carlos Orrego <carlos.orrego@thetradedesk.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants