-
Notifications
You must be signed in to change notification settings - Fork 791
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ArrowArrayReader Reads Too Many Values From Bit-Packed Runs #1111
Comments
So adding a print statement to This is the Now a somewhat strange quirk of the hybrid encoding is packed "runs" are always multiples of 8 in length. This means if the final run of a page is packed encoded, as opposed to RLE, it will zero-padded to length. Unfortunately the parquet designers opted to not store the actual length for a packed run, but the length / 8. This means the length of the final packed run of a page is not actually knowable... This is where the issue arises. The fix should be a case of making whatever calls |
So I'm not sure there is an easy way to fix this... Rather than spending time re-working If someone else wishes to work on fixing FYI @yordan-pavlov @alamb Edit: In the short-term switching back to |
@tustvold 's work on making the "old" arrow reader faster looks promising, plus I've hit a wall in making the
One reason why I implemented |
here is what I've found so far:
it's getting pretty late now, but tomorrow I will try to write the missing test (that doesn't rely on an external parquet file) to reproduce the issue with |
A short term fix would be nice so that we can get correct answers from the existing code, while @tustvold works on the longer term / better fix |
I have been able to reproduce the issue where Here is some sample output from the test: running 1 test ---------- reading a batch of 50 values ---------- ---------- reading a batch of 100 values ---------- VariableLenDictionaryDecoder::read_value_bytes - begin, self.num_values: 0, num_values: 11 |
UPDATE: for the short-term fix, the only option I can think of is (when def levels are present) to count the number of actual values here https://github.com/apache/arrow-rs/blob/master/parquet/src/arrow/arrow_array_reader.rs#L393 before creating the value reader and using this instead of num_values. This then makes the new test (using dictionary encoded pages) pass - notice how in the test output below the value of num_values in the running 1 test test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured; 471 filtered out; finished in 0.01s Tomorrow I will be checking the impact on performance and possibly create a pull request for the new test plus short-term fix. |
UPDATE: performance degradation after the fix is actually not so bad - only between 3% and 8%
I will try to submit a PR for this later today |
Describe the bug
Originally reported in apache/datafusion#1441 and encountered again in #1110,
ParquetFileArrowReader
appears to read incorrect data for string columns that contain nulls.In particular the conditions required are for the column to be nullable, contain nulls, and multiple row groups.
To Reproduce
Read simple_strings.parquet.zip with the following code
Fails with
For comparison
The file consists of two row groups, each with 3 rows and was generated using #1110
Expected behavior
The test should pass
The text was updated successfully, but these errors were encountered: