parquet reading hangs when row_group contains more than 2048 rows of data #349

garyanaplan · 2021-05-25T15:45:47Z

Describe the bug
Reading an apparently valid parquet file (which can be read by java tools such as parquet-tools) from any rust program will hang. CPU load goes to 100%. Reproduced on both 4.0.0 and 4.1.0. rustc: 1.51.0

To Reproduce
Create a parquet file with at least 1 row group (e.g.: 1). Each row group must have > 2048 rows (e.g.: 2049). Run a (rust) program to read the file and it will hang when visiting the 2048th row. Java program (parquet-tools) reads with no issue.

This test program can be used to produce a file that can then be read using parquet-read to reproduce:

    #[test]
    fn it_writes_data() {
        let path = Path::new("sample.parquet");

        let message_type = "
  message ItHangs {
    REQUIRED INT64 DIM0;
    REQUIRED DOUBLE DIM1;
    REQUIRED BYTE_ARRAY DIM2;
    REQUIRED BOOLEAN DIM3;
  }
";
        let schema = Arc::new(parse_message_type(message_type).unwrap());
        let props = Arc::new(WriterProperties::builder().build());
        let file = fs::File::create(&path).unwrap();
        let mut writer = SerializedFileWriter::new(file, schema, props).unwrap();
        for _group in 0..1 {
            let mut row_group_writer = writer.next_row_group().unwrap();
            let values: Vec<i64> = vec![0; 2049];
            let my_values: Vec<i64> = values
                .iter()
                .enumerate()
                .map(|(count, _x)| count.try_into().unwrap())
                .collect();
            let my_double_values: Vec<f64> = values
                .iter()
                .enumerate()
                .map(|(count, _x)| count as f64)
                .collect();
            let my_bool_values: Vec<bool> = values
                .iter()
                .enumerate()
                .map(|(count, _x)| count % 2 == 0)
                .collect();
            let my_ba_values: Vec<ByteArray> = values
                .iter()
                .enumerate()
                .map(|(count, _x)| {
                    let s = format!("{}", count);
                    ByteArray::from(s.as_ref())
                })
                .collect();
            while let Some(mut col_writer) = row_group_writer.next_column().expect("next column") {
                match col_writer {
                    // ... write values to a column writer
                    // You can also use `get_typed_column_writer` method to extract typed writer.
                    ColumnWriter::Int64ColumnWriter(ref mut typed_writer) => {
                        typed_writer
                            .write_batch(&my_values, None, None)
                            .expect("writing int column");
                    }
                    ColumnWriter::DoubleColumnWriter(ref mut typed_writer) => {
                        typed_writer
                            .write_batch(&my_double_values, None, None)
                            .expect("writing double column");
                    }
                    ColumnWriter::BoolColumnWriter(ref mut typed_writer) => {
                        typed_writer
                            .write_batch(&my_bool_values, None, None)
                            .expect("writing bool column");
                    }
                    ColumnWriter::ByteArrayColumnWriter(ref mut typed_writer) => {
                        typed_writer
                            .write_batch(&my_ba_values, None, None)
                            .expect("writing bytes column");
                    }
                    _ => {
                        println!("huh:!");
                    }
                }
                row_group_writer
                    .close_column(col_writer)
                    .expect("close column");
            }
            let rg_md = row_group_writer.close().expect("close row group");
            println!("total rows written: {}", rg_md.num_rows());
            writer
                .close_row_group(row_group_writer)
                .expect("close row groups");
        }
        writer.close().expect("close writer");

        let bytes = fs::read(&path).unwrap();
        assert_eq!(&bytes[0..4], &[b'P', b'A', b'R', b'1']);
    }

Expected behavior
The read will complete without hanging.

Additional context
My development system is Mac OS X, so only tested on OS X.

rustup reports:
active toolchain

1.51.0-x86_64-apple-darwin (default)
rustc 1.51.0 (2fd73fabe 2021-03-23)

The text was updated successfully, but these errors were encountered:

alamb · 2021-05-25T22:07:05Z

Thanks for the report @garyanaplan !

garyanaplan · 2021-05-26T08:42:19Z

yw.

Extra Info: It happens with debug or release builds and I reproduced it with 1.51.0 on a linux system.

k-stanislawek · 2021-06-09T17:44:13Z

I've also just encountered it. Common element with this reproduction is BOOLEAN field. It worked without BOOLEAN as well.

After quick investigation of the looping code, I've found something suspect, but it's just about naming - not sure if it's actual bug.

This function returns something initialized as input's length and called values_to_read:

arrow-rs/parquet/src/util/bit_util.rs

Line 588 in 0f55b82

values_to_read

Meanwhile calling site (which I can't find on Github, because admittedly I'm using older version - will add it later) assigns the return value to values_read.

Btw it loops because after reading 2048 values, this returned value is 0.

garyanaplan · 2021-06-10T08:52:02Z

Yep. If I update my test to remove BOOLEAN from the schema, the problem goes away. I've done some digging around today and noticed that it looks like the problem might lie in the generation of the file.

I previously reported that parquet-tools dump would happily process the file, however I trimmed down the example to just include BOOLEAN field in schema and increased the number of rows in the group and noted the following when dumping:

value 2039: R:0 D:0 V:true value 2040: R:0 D:0 V:false value 2041: R:0 D:0 V:true value 2042: R:0 D:0 V:false value 2043: R:0 D:0 V:true value 2044: R:0 D:0 V:false value 2045: R:0 D:0 V:true value 2046: R:0 D:0 V:false value 2047: R:0 D:0 V:true value 2048: R:0 D:0 V:false value 2049: R:0 D:0 V:false value 2050: R:0 D:0 V:false value 2051: R:0 D:0 V:false value 2052: R:0 D:0 V:false value 2053: R:0 D:0 V:false value 2054: R:0 D:0 V:false value 2055: R:0 D:0 V:false
All the values after 2048 are false and they continue to be false until the end of the file.
It looks like the generated input file is invalid, so I'm going to poke around there a little next.

garyanaplan · 2021-06-10T10:39:21Z

More poking reveals that PlainEncoder has a bit_writer with a hard-coded size of 256 (big enough to hold 2048 bits...).
src/encodings/encoding.rs: line bit_writer: BitWriter::new(256),
If you adjust that value up or down you trip the error at different times. So, that looks like it's a contributing factor. I'm now trying to understand the logic around buffer flushing and re-use. Feel, like I'm getting close to the root cause.

garyanaplan · 2021-06-10T12:13:17Z

Looks like that hard-coded value (256) in the bit-writer is the root cause. When writing, if we try to put > 2048 boolean values, then the writer just "ignores" the writes. This is caused by the fact that bool encoder silently ignores calls to put_value that return false.

I have a fix for this which works by extending the size of the BitWriter (in 256 byte) increments and also checks the return of put_value in BoolType::encode() and raises an error if the call fails.

Can anyone comment on this approach?

(diff attached)

a.diff.txt

alamb · 2021-06-10T12:43:38Z

@garyanaplan -- I think the best way to get feedback on the approach would be to open a pull request

garyanaplan · 2021-06-10T12:47:54Z

Yeah. I'm not really happy with it, because I don't love the special handling for Booleans via the BitWriter. Just growing the buffer indefinitely seems "wrong", but I think any other kind of fix would be much more extensive/intrusive.

I'll file the PR and see what feedback I get.

When writing BOOLEAN data, writing more than 2048 rows of data will overflow the hard-coded 256 buffer set for the bit-writer in the PlainEncoder. Once this occurs, further attempts to write to the encoder fail, becuase capacity is exceeded, but the errors are silently ignored. This fix improves the error detection and reporting at the point of encoding and modifies the logic for bit_writing (BOOLEANS). The bit_writer is initially allocated 256 bytes (as at present), then each time the capacity is exceeded the capacity is incremented by another 256 bytes. This certainly resolves the current problem, but it's not exactly a great fix because the capacity of the bit_writer could now grow substantially. Other data types seem to have a more sophisticated mechanism for writing data which doesn't involve growing or having a fixed size buffer. It would be desirable to make the BOOLEAN type use this same mechanism if possible, but that level of change is more intrusive and probably requires greater knowledge of the implementation than I possess. resolves: apache#349

…ail (#443) * improve BOOLEAN writing logic and report error on encoding fail When writing BOOLEAN data, writing more than 2048 rows of data will overflow the hard-coded 256 buffer set for the bit-writer in the PlainEncoder. Once this occurs, further attempts to write to the encoder fail, becuase capacity is exceeded, but the errors are silently ignored. This fix improves the error detection and reporting at the point of encoding and modifies the logic for bit_writing (BOOLEANS). The bit_writer is initially allocated 256 bytes (as at present), then each time the capacity is exceeded the capacity is incremented by another 256 bytes. This certainly resolves the current problem, but it's not exactly a great fix because the capacity of the bit_writer could now grow substantially. Other data types seem to have a more sophisticated mechanism for writing data which doesn't involve growing or having a fixed size buffer. It would be desirable to make the BOOLEAN type use this same mechanism if possible, but that level of change is more intrusive and probably requires greater knowledge of the implementation than I possess. resolves: #349 * only manipulate the bit_writer for BOOLEAN data Tacky, but I can't think of better way to do this without specialization. * better isolation of changes Remove the byte tracking from the PlainEncoder and use the existing bytes_written() method in BitWriter. This is neater. * add test for boolean writer The test ensures that we can write > 2048 rows to a parquet file and that when we read the data back, it finishes without hanging (defined as taking < 5 seconds). If we don't want that extra complexity, we could remove the thread/channel stuff and just try to read the file and let the test runner terminate hanging tests. * fix capacity calculation error in bool encoding The values.len() reports the number of values to be encoded and so must be divided by 8 (bits in a bytes) to determine the effect on the byte capacity of the bit_writer.

* simplify interactions with arrow flight APIs Initial work to implement some basic traits * more polishing and introduction of a couple of wrapper types Some more polishing of the basic code I provided last week. * More polishing Add support for representing tickets as base64 encoded strings. Also: more polishing of Display, etc... * improve BOOLEAN writing logic and report error on encoding fail When writing BOOLEAN data, writing more than 2048 rows of data will overflow the hard-coded 256 buffer set for the bit-writer in the PlainEncoder. Once this occurs, further attempts to write to the encoder fail, becuase capacity is exceeded, but the errors are silently ignored. This fix improves the error detection and reporting at the point of encoding and modifies the logic for bit_writing (BOOLEANS). The bit_writer is initially allocated 256 bytes (as at present), then each time the capacity is exceeded the capacity is incremented by another 256 bytes. This certainly resolves the current problem, but it's not exactly a great fix because the capacity of the bit_writer could now grow substantially. Other data types seem to have a more sophisticated mechanism for writing data which doesn't involve growing or having a fixed size buffer. It would be desirable to make the BOOLEAN type use this same mechanism if possible, but that level of change is more intrusive and probably requires greater knowledge of the implementation than I possess. resolves: #349 * only manipulate the bit_writer for BOOLEAN data Tacky, but I can't think of better way to do this without specialization. * better isolation of changes Remove the byte tracking from the PlainEncoder and use the existing bytes_written() method in BitWriter. This is neater. * add test for boolean writer The test ensures that we can write > 2048 rows to a parquet file and that when we read the data back, it finishes without hanging (defined as taking < 5 seconds). If we don't want that extra complexity, we could remove the thread/channel stuff and just try to read the file and let the test runner terminate hanging tests. * fix capacity calculation error in bool encoding The values.len() reports the number of values to be encoded and so must be divided by 8 (bits in a bytes) to determine the effect on the byte capacity of the bit_writer. * make BasicAuth accessible Following merge with master, make sure this is exposed so that integration tests work. also: there has been a release since I last looked at this so update the deprecation warnings. * fix documentation for ipc_message_from_arrow_schema TryFrom, not From * replace deprecated functions in integrations tests with traits clippy complains about using deprecated functions, so replace them with the new trait support. also: fix the trait documentation * address review comments - update deprecated warnings - improve TryFrom for DescriptorType

MichaelBitard · 2021-07-13T16:18:21Z

This still happens with parquet 4.4.0, it may be related to an other type, I'll try to reproduce it with a minimal example, but right now, it always hangs after reading 2046 rows.

EDIT: I just saw this was on master but not released yet. Is there a way to have this fix on a 4.4.1 or is this too much relying on the 5.0.0 SNAPSHOT? I can try and make a PR based on the 4.4.0 if you think you could release it

MichaelBitard · 2021-07-13T16:48:00Z

I just updated prqs to use the latest version of parquet-rs and arrow (commit 6698eed) and the issue still happens with the example you provided @garyanaplan. It is stuck at 2046 rows read.

To reproduce:

clone https://github.com/MichaelBitard/pqrs (I pushed the sample.parquet in the repository)
launch cargo run rowcount sample.parquet --> you'll see 2049 lines
launch cargo run cat sample.parquet
- It'll hang at the 2046th line:

CurrentRow 2046 0
{DIM0: 2046, DIM1: 2046.0, DIM2: [50, 48, 52, 54], DIM3: true}

It is stuck in the print_rows function: https://github.com/MichaelBitard/pqrs/blob/master/src/utils.rs#L55

garyanaplan · 2021-07-14T12:44:13Z

Hi @MichaelBitard,

Unfortunately, the problem was caused by writing a parquet file. I imagine you created your sample.parquet file with the unfixed version. That would mean you would still hit the problem when reading.

Can you confirm that sample.parquet was created with the fixed code and then verify that it will read ok?

Gary

MichaelBitard · 2021-07-14T15:40:07Z

Oops, you are right, sorry.

If I generate the sample.parquet with the latest version, it not longer hangs during reading.

Thanks for noticing and sorry again!

alamb · 2021-07-14T19:20:31Z

Thank you @MichaelBitard for taking the time to report it!

garyanaplan added the bug label May 25, 2021

garyanaplan mentioned this issue Jun 10, 2021

parquet: improve BOOLEAN writing logic and report error on encoding fail #443

Merged

alamb closed this as completed in #443 Jun 16, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

parquet reading hangs when row_group contains more than 2048 rows of data #349

parquet reading hangs when row_group contains more than 2048 rows of data #349

garyanaplan commented May 25, 2021 •

edited

Loading

alamb commented May 25, 2021

garyanaplan commented May 26, 2021

k-stanislawek commented Jun 9, 2021 •

edited

Loading

garyanaplan commented Jun 10, 2021

garyanaplan commented Jun 10, 2021

garyanaplan commented Jun 10, 2021 •

edited

Loading

alamb commented Jun 10, 2021

garyanaplan commented Jun 10, 2021

MichaelBitard commented Jul 13, 2021 •

edited

Loading

MichaelBitard commented Jul 13, 2021 •

edited

Loading

garyanaplan commented Jul 14, 2021

MichaelBitard commented Jul 14, 2021

alamb commented Jul 14, 2021

parquet reading hangs when row_group contains more than 2048 rows of data #349

parquet reading hangs when row_group contains more than 2048 rows of data #349

Comments

garyanaplan commented May 25, 2021 • edited Loading

rustup reports: active toolchain

alamb commented May 25, 2021

garyanaplan commented May 26, 2021

k-stanislawek commented Jun 9, 2021 • edited Loading

garyanaplan commented Jun 10, 2021

garyanaplan commented Jun 10, 2021

garyanaplan commented Jun 10, 2021 • edited Loading

alamb commented Jun 10, 2021

garyanaplan commented Jun 10, 2021

MichaelBitard commented Jul 13, 2021 • edited Loading

MichaelBitard commented Jul 13, 2021 • edited Loading

garyanaplan commented Jul 14, 2021

MichaelBitard commented Jul 14, 2021

alamb commented Jul 14, 2021

garyanaplan commented May 25, 2021 •

edited

Loading

rustup reports:
active toolchain

k-stanislawek commented Jun 9, 2021 •

edited

Loading

garyanaplan commented Jun 10, 2021 •

edited

Loading

MichaelBitard commented Jul 13, 2021 •

edited

Loading

MichaelBitard commented Jul 13, 2021 •

edited

Loading