-
Notifications
You must be signed in to change notification settings - Fork 810
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
parquet reading hangs when row_group contains more than 2048 rows of data #349
Comments
Thanks for the report @garyanaplan ! |
yw. Extra Info: It happens with debug or release builds and I reproduced it with 1.51.0 on a linux system. |
I've also just encountered it. Common element with this reproduction is BOOLEAN field. It worked without BOOLEAN as well. After quick investigation of the looping code, I've found something suspect, but it's just about naming - not sure if it's actual bug. This function returns something initialized as input's length and called arrow-rs/parquet/src/util/bit_util.rs Line 588 in 0f55b82
Meanwhile calling site (which I can't find on Github, because admittedly I'm using older version - will add it later) assigns the return value to Btw it loops because after reading 2048 values, this returned value is 0. |
Yep. If I update my test to remove BOOLEAN from the schema, the problem goes away. I've done some digging around today and noticed that it looks like the problem might lie in the generation of the file. I previously reported that parquet-tools dump would happily process the file, however I trimmed down the example to just include BOOLEAN field in schema and increased the number of rows in the group and noted the following when dumping:
|
More poking reveals that PlainEncoder has a bit_writer with a hard-coded size of 256 (big enough to hold 2048 bits...). |
Looks like that hard-coded value (256) in the bit-writer is the root cause. When writing, if we try to put > 2048 boolean values, then the writer just "ignores" the writes. This is caused by the fact that bool encoder silently ignores calls to put_value that return false. I have a fix for this which works by extending the size of the BitWriter (in 256 byte) increments and also checks the return of put_value in BoolType::encode() and raises an error if the call fails. Can anyone comment on this approach? (diff attached) |
@garyanaplan -- I think the best way to get feedback on the approach would be to open a pull request |
Yeah. I'm not really happy with it, because I don't love the special handling for Booleans via the BitWriter. Just growing the buffer indefinitely seems "wrong", but I think any other kind of fix would be much more extensive/intrusive. I'll file the PR and see what feedback I get. |
When writing BOOLEAN data, writing more than 2048 rows of data will overflow the hard-coded 256 buffer set for the bit-writer in the PlainEncoder. Once this occurs, further attempts to write to the encoder fail, becuase capacity is exceeded, but the errors are silently ignored. This fix improves the error detection and reporting at the point of encoding and modifies the logic for bit_writing (BOOLEANS). The bit_writer is initially allocated 256 bytes (as at present), then each time the capacity is exceeded the capacity is incremented by another 256 bytes. This certainly resolves the current problem, but it's not exactly a great fix because the capacity of the bit_writer could now grow substantially. Other data types seem to have a more sophisticated mechanism for writing data which doesn't involve growing or having a fixed size buffer. It would be desirable to make the BOOLEAN type use this same mechanism if possible, but that level of change is more intrusive and probably requires greater knowledge of the implementation than I possess. resolves: apache#349
…ail (#443) * improve BOOLEAN writing logic and report error on encoding fail When writing BOOLEAN data, writing more than 2048 rows of data will overflow the hard-coded 256 buffer set for the bit-writer in the PlainEncoder. Once this occurs, further attempts to write to the encoder fail, becuase capacity is exceeded, but the errors are silently ignored. This fix improves the error detection and reporting at the point of encoding and modifies the logic for bit_writing (BOOLEANS). The bit_writer is initially allocated 256 bytes (as at present), then each time the capacity is exceeded the capacity is incremented by another 256 bytes. This certainly resolves the current problem, but it's not exactly a great fix because the capacity of the bit_writer could now grow substantially. Other data types seem to have a more sophisticated mechanism for writing data which doesn't involve growing or having a fixed size buffer. It would be desirable to make the BOOLEAN type use this same mechanism if possible, but that level of change is more intrusive and probably requires greater knowledge of the implementation than I possess. resolves: #349 * only manipulate the bit_writer for BOOLEAN data Tacky, but I can't think of better way to do this without specialization. * better isolation of changes Remove the byte tracking from the PlainEncoder and use the existing bytes_written() method in BitWriter. This is neater. * add test for boolean writer The test ensures that we can write > 2048 rows to a parquet file and that when we read the data back, it finishes without hanging (defined as taking < 5 seconds). If we don't want that extra complexity, we could remove the thread/channel stuff and just try to read the file and let the test runner terminate hanging tests. * fix capacity calculation error in bool encoding The values.len() reports the number of values to be encoded and so must be divided by 8 (bits in a bytes) to determine the effect on the byte capacity of the bit_writer.
* simplify interactions with arrow flight APIs Initial work to implement some basic traits * more polishing and introduction of a couple of wrapper types Some more polishing of the basic code I provided last week. * More polishing Add support for representing tickets as base64 encoded strings. Also: more polishing of Display, etc... * improve BOOLEAN writing logic and report error on encoding fail When writing BOOLEAN data, writing more than 2048 rows of data will overflow the hard-coded 256 buffer set for the bit-writer in the PlainEncoder. Once this occurs, further attempts to write to the encoder fail, becuase capacity is exceeded, but the errors are silently ignored. This fix improves the error detection and reporting at the point of encoding and modifies the logic for bit_writing (BOOLEANS). The bit_writer is initially allocated 256 bytes (as at present), then each time the capacity is exceeded the capacity is incremented by another 256 bytes. This certainly resolves the current problem, but it's not exactly a great fix because the capacity of the bit_writer could now grow substantially. Other data types seem to have a more sophisticated mechanism for writing data which doesn't involve growing or having a fixed size buffer. It would be desirable to make the BOOLEAN type use this same mechanism if possible, but that level of change is more intrusive and probably requires greater knowledge of the implementation than I possess. resolves: #349 * only manipulate the bit_writer for BOOLEAN data Tacky, but I can't think of better way to do this without specialization. * better isolation of changes Remove the byte tracking from the PlainEncoder and use the existing bytes_written() method in BitWriter. This is neater. * add test for boolean writer The test ensures that we can write > 2048 rows to a parquet file and that when we read the data back, it finishes without hanging (defined as taking < 5 seconds). If we don't want that extra complexity, we could remove the thread/channel stuff and just try to read the file and let the test runner terminate hanging tests. * fix capacity calculation error in bool encoding The values.len() reports the number of values to be encoded and so must be divided by 8 (bits in a bytes) to determine the effect on the byte capacity of the bit_writer. * make BasicAuth accessible Following merge with master, make sure this is exposed so that integration tests work. also: there has been a release since I last looked at this so update the deprecation warnings. * fix documentation for ipc_message_from_arrow_schema TryFrom, not From * replace deprecated functions in integrations tests with traits clippy complains about using deprecated functions, so replace them with the new trait support. also: fix the trait documentation * address review comments - update deprecated warnings - improve TryFrom for DescriptorType
This still happens with parquet 4.4.0, it may be related to an other type, I'll try to reproduce it with a minimal example, but right now, it always hangs after reading 2046 rows. EDIT: I just saw this was on master but not released yet. Is there a way to have this fix on a 4.4.1 or is this too much relying on the 5.0.0 SNAPSHOT? I can try and make a PR based on the 4.4.0 if you think you could release it |
I just updated prqs to use the latest version of parquet-rs and arrow (commit 6698eed) and the issue still happens with the example you provided @garyanaplan. It is stuck at 2046 rows read. To reproduce:
It is stuck in the print_rows function: https://github.com/MichaelBitard/pqrs/blob/master/src/utils.rs#L55 |
Hi @MichaelBitard, Unfortunately, the problem was caused by writing a parquet file. I imagine you created your sample.parquet file with the unfixed version. That would mean you would still hit the problem when reading. Can you confirm that sample.parquet was created with the fixed code and then verify that it will read ok? Gary |
Oops, you are right, sorry. If I generate the sample.parquet with the latest version, it not longer hangs during reading. Thanks for noticing and sorry again! |
Thank you @MichaelBitard for taking the time to report it! |
Describe the bug
Reading an apparently valid parquet file (which can be read by java tools such as parquet-tools) from any rust program will hang. CPU load goes to 100%. Reproduced on both 4.0.0 and 4.1.0. rustc: 1.51.0
To Reproduce
Create a parquet file with at least 1 row group (e.g.: 1). Each row group must have > 2048 rows (e.g.: 2049). Run a (rust) program to read the file and it will hang when visiting the 2048th row. Java program (parquet-tools) reads with no issue.
This test program can be used to produce a file that can then be read using parquet-read to reproduce:
Expected behavior
The read will complete without hanging.
Additional context
My development system is Mac OS X, so only tested on OS X.
rustup reports:
active toolchain
1.51.0-x86_64-apple-darwin (default)
rustc 1.51.0 (2fd73fabe 2021-03-23)
The text was updated successfully, but these errors were encountered: