-
Notifications
You must be signed in to change notification settings - Fork 874
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
parquet-fromcsv with writer version v2 does not stop #3408
Comments
Thank you for the report @XinyuZeng ! |
I'll pick this up. The issue is with the column In arrow-rs/parquet/src/arrow/arrow_writer/byte_array.rs Lines 282 to 284 in 08a976f
In
The effect of this is ever growing buffer size. This ever growing buffer is written to output every mini batch (1000 rows). Hence the program's output consumes all the disk space. The fix would be to use I see the same issue in |
cc @tustvold |
This will be used if the user has manually specified an encoding to use for the column, in addition to the default dictionary encoding
FWIW calling
Eek, that would definitely cause issues. It is somewhat concerning that there is no test coverage of this, I guess the fuzz tests don't run into this as they are using a lower-level writer API |
Describe the bug
When using the executable parquet-fromcsv to convert csv into parquet and set writer version to 2, the program won't stop. The result Parquet file's size grows indefinitely until filling up the disk.
To Reproduce
I've run with two different schema and csv files, both failed. One is from TPCH lineitem table SF10. The command is:
parquet-fromcsv -s test_schema.txt -i core_test.csv -o core_test.parquet -w 2
Without
-w 2
works fine.The TPCH lineitem schema file is:
CSV file can be generated using TPCH tools.
Expected behavior
Additional context
The text was updated successfully, but these errors were encountered: