ArrowArrayReader Reads Too Many Values From Bit-Packed Runs #1111

tustvold · 2021-12-30T16:42:35Z

Describe the bug

Originally reported in apache/datafusion#1441 and encountered again in #1110, ParquetFileArrowReader appears to read incorrect data for string columns that contain nulls.

In particular the conditions required are for the column to be nullable, contain nulls, and multiple row groups.

To Reproduce

Read simple_strings.parquet.zip with the following code

#[test]
    fn test_read_strings() {
        let testdata = arrow::util::test_util::parquet_test_data();
        let path = format!("{}/simple_strings.parquet", testdata);
        let parquet_file_reader =
            SerializedFileReader::try_from(File::open(&path).unwrap()).unwrap();
        let mut arrow_reader = ParquetFileArrowReader::new(Arc::new(parquet_file_reader));
        let record_batch_reader = arrow_reader
            .get_record_reader(60)
            .expect("Failed to read into array!");

        let batches = record_batch_reader
            .collect::<arrow::error::Result<Vec<_>>>()
            .unwrap();

        assert_eq!(batches.len(), 1);
        let batch = batches.into_iter().next().unwrap();
        assert_eq!(batch.num_rows(), 6);

        let strings = batch
            .column(0)
            .as_any()
            .downcast_ref::<StringArray>()
            .unwrap();

        let strings: Vec<_> = strings.iter().collect();

        assert_eq!(
            &strings,
            &[
                None,
                Some("-1685637712"),
                Some("512814980"),
                Some("868743207"),
                None,
                Some("-1001940778")
            ]
        )
    }

Fails with

thread 'arrow::arrow_reader::tests::test_read_strings' panicked at 'assertion failed: `(left == right)`
  left: `[None, Some("-1685637712"), Some("512814980"), Some("-1685637712"), None, Some("868743207")]`,
 right: `[None, Some("-1685637712"), Some("512814980"), Some("868743207"), None, Some("-1001940778")]`', parquet/src/arrow/arrow_reader.rs:715:9

For comparison

$ python
> import duckdb
> duckdb.query("select * from 'simple_strings.parquet'").fetchall()
[(None,), ('-1685637712',), ('512814980',), ('868743207',), (None,), ('-1001940778',)]

The file consists of two row groups, each with 3 rows and was generated using #1110

Expected behavior

The test should pass

The text was updated successfully, but these errors were encountered:

tustvold · 2021-12-30T17:29:43Z

So adding a print statement to VariableLenDictionaryDecoder::new it is being created twice with num_values of 3, i.e. the number of rows in the row group.

This is the num_values field from Page, which confusingly is the number of values including nulls. This value is then used to determine how many values to read from RleDecoder for this page.

Now a somewhat strange quirk of the hybrid encoding is packed "runs" are always multiples of 8 in length. This means if the final run of a page is packed encoded, as opposed to RLE, it will zero-padded to length. Unfortunately the parquet designers opted to not store the actual length for a packed run, but the length / 8. This means the length of the final packed run of a page is not actually knowable...

This is where the issue arises. VariableLenDictionaryDecoder thinks it has more actual values than it does, as it is being fed the value_count for the page which counts nulls which aren't encoded. This means it asks RleDecoder for more keys than should actually be present. As RleDecoder contains a zero-padded final run, it returns too many values, which has the effect of "shifting" the string values in the final result.

The fix should be a case of making whatever calls ValueDecoder::read_value_bytes only request a number of values that the page should be expected to yield. This is what ColumnReaderImpl handles for the non-ArrowArrayReader ArrayReader implementations. I need to do some digging to see how feasible this is with the design of ArrowArrayReader.

tustvold · 2021-12-30T19:23:20Z

So I'm not sure there is an easy way to fix this... ArrowArrayReader flattens all the pages from all the column chunks into iterators and then feeds these to CompositeValueDecoder which decode the levels and values independently. This makes it a non-trivial change to decode the levels and corresponding values from a given page in lock-step, which I believe is necessary in order to decode the correct number.

Rather than spending time re-working ArrowArrayReader in order to fix this bug, I'm personally going to focus on getting #1082 and the PRs it builds on polished up. This provides an alternative implementation for reading byte arrays, that builds on the existing ColumnReaderImpl and RecordReader logic and so, much like PrimitiveArrayReader, does not run into this bug. My hope is that by being both faster, and duplicating less code, it will make sense to swap out ArrowArrayReader and therefore fix this bug for anything not using ArrowArrayReader explicitly.

If someone else wishes to work on fixing ArrowArrayReader that would be brilliant, but I'm going to focus my efforts elsewhere.

FYI @yordan-pavlov @alamb

Edit: In the short-term switching back to ComplexObjectArrayReader does fix the bug, but represents a non-trivial performance regression (up to 6x) and so I'm somewhat loathe to suggest it

yordan-pavlov · 2021-12-30T20:48:49Z

@tustvold 's work on making the "old" arrow reader faster looks promising, plus I've hit a wall in making the ArrowArrayReader faster in some cases, so I agree that it doesn't make sense to try to fix the ArrowArrayReader and instead it makes more sense to focus on finishing the work in #1082 .

ArrowArrayReader does use def levels to calculate how many values to read, see here https://github.com/apache/arrow-rs/blob/master/parquet/src/arrow/arrow_array_reader.rs#L576 , but reading def levels is done independently of reading of values (one of the main characteristics of ArrowArrayReader) and so if the value of num_values field from Page is incorrect and does indeed include NULL values, then the entire idea of ArrowArrayReader falls apart and it won't work correctly. It makes me wonder how the unit tests pass though, I will have a look at this in the next few days.

One reason why I implemented ArrowArrayReader in a new struct so that it can easily be swapped out if was necessary, so I guess that time has come. Yes - it will result in a significant reduction in performance of reading string arrays, but correctness is more important than speed and also it would be only until #1082 is finished.

yordan-pavlov · 2021-12-31T00:43:05Z

here is what I've found so far:

there is a test for plain-encoded strings which is working with null values and across pages here https://github.com/apache/arrow-rs/blob/master/parquet/src/arrow/arrow_array_reader.rs#L1493
VariableLenPlainDecoder gets used in the above test and it does work correctly, because although the value of num_values for the decoder does include NULLs, it stops reading from the page correctly because it checks that it doesn't read out of the values buffer ( while self.position < data_len ) here https://github.com/apache/arrow-rs/blob/master/parquet/src/arrow/arrow_array_reader.rs#L919
what is missing is a test that exercises the VariableLenDictionaryDecoder
the VariableLenDictionaryDecoder relies on the RleDecoder to not read out of its buffer and to return 0 when no more values can be read here https://github.com/apache/arrow-rs/blob/master/parquet/src/arrow/arrow_array_reader.rs#L1069

it's getting pretty late now, but tomorrow I will try to write the missing test (that doesn't rely on an external parquet file) to reproduce the issue with VariableLenDictionaryDecoder / RleDecoder and also think on a short-term fix

alamb · 2021-12-31T12:06:17Z

it's getting pretty late now, but tomorrow I will try to write the missing test (that doesn't rely on an external parquet file) to reproduce the issue with VariableLenDictionaryDecoder / RleDecoder and also think on a short-term fix

A short term fix would be nice so that we can get correct answers from the existing code, while @tustvold works on the longer term / better fix

yordan-pavlov · 2021-12-31T20:25:16Z

I have been able to reproduce the issue where RleDecoder returns more keys than values (as explained by @tustvold above) by adding a test very similar to the existing test_arrow_array_reader_string but using dictionary encoding instead of plain. Next, I will looking for a short-term fix.

Here is some sample output from the test:

running 1 test
page num_values: 100, values.len(): 33
page num_values: 100, values.len(): 38
VariableLenPlainDecoder::new, num_values: 9

---------- reading a batch of 50 values ----------
VariableLenDictionaryDecoder::new, num_values: 100
VariableLenDictionaryDecoder::read_value_bytes - begin, self.num_values: 100, num_values: 14
VariableLenDictionaryDecoder::read_value_bytes - end, values_read: 14, self.num_values: 86
// ok so far, 33 actual values - 14 values read = 19 values still left in first page

---------- reading a batch of 100 values ----------
VariableLenPlainDecoder::new, num_values: 10
VariableLenDictionaryDecoder::new, num_values: 100
VariableLenDictionaryDecoder::read_value_bytes - begin, self.num_values: 86, num_values: 37
VariableLenDictionaryDecoder::read_value_bytes - end, values_read: 26, self.num_values: 0
// this is a problem - only 19 values were left in the first page, but 26 values have been read

VariableLenDictionaryDecoder::read_value_bytes - begin, self.num_values: 0, num_values: 11
VariableLenDictionaryDecoder::read_value_bytes - end, values_read: 0, self.num_values: 0
VariableLenDictionaryDecoder::read_value_bytes - begin, self.num_values: 100, num_values: 11
VariableLenDictionaryDecoder::read_value_bytes - end, values_read: 11, self.num_values: 89
thread 'arrow::arrow_array_reader::tests::test_arrow_array_reader_dict_string' panicked at 'assertion failed: (left == right)
left: "H",
right: "He"', parquet\src\arrow\arrow_array_reader.rs:1745:17

yordan-pavlov · 2022-01-01T22:59:36Z

UPDATE: for the short-term fix, the only option I can think of is (when def levels are present) to count the number of actual values here https://github.com/apache/arrow-rs/blob/master/parquet/src/arrow/arrow_array_reader.rs#L393 before creating the value reader and using this instead of num_values.

This then makes the new test (using dictionary encoded pages) pass - notice how in the test output below the value of num_values in the VariableLenDictionaryDecoder is the actual number of values instead of including null-values:

running 1 test
page num_values: 100, values.len(): 25
page num_values: 100, values.len(): 31
VariableLenPlainDecoder::new, num_values: 10
---------- reading a batch of 50 values ----------
VariableLenDictionaryDecoder::new, num_values: 25
VariableLenDictionaryDecoder::read_value_bytes - begin, self.num_values: 25, num_values: 11
VariableLenDictionaryDecoder::read_value_bytes - end, values_read: 11, self.num_values: 14
---------- reading a batch of 100 values ----------
VariableLenPlainDecoder::new, num_values: 10
VariableLenDictionaryDecoder::new, num_values: 31
VariableLenDictionaryDecoder::read_value_bytes - begin, self.num_values: 14, num_values: 31
VariableLenDictionaryDecoder::read_value_bytes - end, values_read: 14, self.num_values: 0
VariableLenDictionaryDecoder::read_value_bytes - begin, self.num_values: 0, num_values: 17
VariableLenDictionaryDecoder::read_value_bytes - end, values_read: 0, self.num_values: 0
VariableLenDictionaryDecoder::read_value_bytes - begin, self.num_values: 31, num_values: 17
VariableLenDictionaryDecoder::read_value_bytes - end, values_read: 17, self.num_values: 14
---------- reading a batch of 100 values ----------
VariableLenDictionaryDecoder::read_value_bytes - begin, self.num_values: 14, num_values: 14
VariableLenDictionaryDecoder::read_value_bytes - end, values_read: 14, self.num_values: 0
test arrow::arrow_array_reader::tests::test_arrow_array_reader_dict_string ... ok

test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured; 471 filtered out; finished in 0.01s

Tomorrow I will be checking the impact on performance and possibly create a pull request for the new test plus short-term fix.

yordan-pavlov · 2022-01-02T13:44:10Z

UPDATE: performance degradation after the fix is actually not so bad - only between 3% and 8%

arrow_array_reader/read StringArray, plain encoded, optional, half NULLs - old
                        time:   [1.4393 ms 1.4485 ms 1.4585 ms]
                        change: [-9.2499% -3.6008% +1.8053%] (p = 0.23 > 0.05)
arrow_array_reader/read StringArray, plain encoded, optional, half NULLs - new
                        time:   [594.37 us 604.91 us 618.75 us]
                        change: [+5.1998% +8.7402% +12.498%] (p = 0.00 < 0.05)

arrow_array_reader/read StringArray, dictionary encoded, optional, half NULLs - old
                        time:   [1.1766 ms 1.1857 ms 1.1950 ms]
                        change: [-4.5337% -2.8033% -1.0591%] (p = 0.00 < 0.05)
arrow_array_reader/read StringArray, dictionary encoded, optional, half NULLs - new
                        time:   [467.90 us 471.83 us 476.11 us]
                        change: [-0.8465% +3.1251% +7.0623%] (p = 0.12 > 0.05)

I will try to submit a PR for this later today

* fix reading of dictionary encoded pages with null values * fix linting issues

tustvold added the bug label Dec 30, 2021

tustvold changed the title ~~ArrowArrayReader Incorrect Data~~ ArrowArrayReader Reads Too Many Values From Bit-Packed Runs Dec 30, 2021

tustvold mentioned this issue Jan 1, 2022

Generify ColumnReaderImpl and RecordReader (#1040) #1041

Merged

yordan-pavlov mentioned this issue Jan 2, 2022

Fix reading of dictionary encoded pages with null values (#1111) #1130

Merged

alamb closed this as completed in #1130 Jan 5, 2022

alamb pushed a commit that referenced this issue Jan 5, 2022

Fix reading of dictionary encoded pages with null values (#1111) (#1130)

430bdd4

* fix reading of dictionary encoded pages with null values * fix linting issues

tustvold mentioned this issue Jan 16, 2022

parquet: Optimized ByteArrayReader, Add UTF-8 Validation (#1040) #1082

Merged

tustvold mentioned this issue Jan 26, 2022

Fixed reading parquet binary dict page jorgecarleitao/arrow2#791

Merged

tustvold mentioned this issue Feb 5, 2022

Expose Dictionary to reader #1270

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ArrowArrayReader Reads Too Many Values From Bit-Packed Runs #1111

ArrowArrayReader Reads Too Many Values From Bit-Packed Runs #1111

tustvold commented Dec 30, 2021 •

edited

Loading

tustvold commented Dec 30, 2021 •

edited

Loading

tustvold commented Dec 30, 2021 •

edited

Loading

yordan-pavlov commented Dec 30, 2021 •

edited

Loading

yordan-pavlov commented Dec 31, 2021

alamb commented Dec 31, 2021

yordan-pavlov commented Dec 31, 2021

yordan-pavlov commented Jan 1, 2022

yordan-pavlov commented Jan 2, 2022

ArrowArrayReader Reads Too Many Values From Bit-Packed Runs #1111

ArrowArrayReader Reads Too Many Values From Bit-Packed Runs #1111

Comments

tustvold commented Dec 30, 2021 • edited Loading

tustvold commented Dec 30, 2021 • edited Loading

tustvold commented Dec 30, 2021 • edited Loading

yordan-pavlov commented Dec 30, 2021 • edited Loading

yordan-pavlov commented Dec 31, 2021

alamb commented Dec 31, 2021

yordan-pavlov commented Dec 31, 2021

yordan-pavlov commented Jan 1, 2022

yordan-pavlov commented Jan 2, 2022

tustvold commented Dec 30, 2021 •

edited

Loading

tustvold commented Dec 30, 2021 •

edited

Loading

tustvold commented Dec 30, 2021 •

edited

Loading

yordan-pavlov commented Dec 30, 2021 •

edited

Loading