Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

parquet-fromcsv with writer version v2 does not stop #3408

Closed
XinyuZeng opened this issue Dec 29, 2022 · 4 comments · Fixed by #3447
Closed

parquet-fromcsv with writer version v2 does not stop #3408

XinyuZeng opened this issue Dec 29, 2022 · 4 comments · Fixed by #3447
Assignees
Labels
bug parquet Changes to the parquet crate

Comments

@XinyuZeng
Copy link
Contributor

XinyuZeng commented Dec 29, 2022

Describe the bug

When using the executable parquet-fromcsv to convert csv into parquet and set writer version to 2, the program won't stop. The result Parquet file's size grows indefinitely until filling up the disk.

To Reproduce

I've run with two different schema and csv files, both failed. One is from TPCH lineitem table SF10. The command is:
parquet-fromcsv -s test_schema.txt -i core_test.csv -o core_test.parquet -w 2

Without -w 2 works fine.

The TPCH lineitem schema file is:

message schema {
  optional int64 l_orderkey;
  optional int64 l_partkey;
  optional int64 l_suppkey;
  optional int64 l_linenumber;
  optional int64 l_quantity;
  optional double l_extendedprice;
  optional double l_discount;
  optional double l_tax;
  optional binary l_returnflag (String);
  optional binary l_linestatus (String);
  optional binary l_shipdate (String);
  optional binary l_commitdate (String);
  optional binary l_receiptdate (String);
  optional binary l_shipinstruct (String);
  optional binary l_shipmode (String);
  optional binary l_comment (String);
}

CSV file can be generated using TPCH tools.

Expected behavior

Additional context

@XinyuZeng XinyuZeng added the bug label Dec 29, 2022
@alamb
Copy link
Contributor

alamb commented Dec 31, 2022

Thank you for the report @XinyuZeng !

@askoa
Copy link
Contributor

askoa commented Jan 3, 2023

I'll pick this up. The issue is with the column l_comment. This column exhausts dict_encoder quickly and starts using FallbackEncoder.

In V1 the FallbackEncoderImpl::PLAIN encoder is used where the buffer is cleared in flush_data_page

FallbackEncoderImpl::Plain { buffer } => {
(std::mem::take(buffer), Encoding::PLAIN)
}

In V2 the FallbackEncoderImpl::Delta encoder is used where the buffer is not cleared in flush_data_page

out.extend_from_slice(buffer);

The effect of this is ever growing buffer size. This ever growing buffer is written to output every mini batch (1000 rows). Hence the program's output consumes all the disk space. The fix would be to use std::mem::take(buffer) in FallbackEncoderImpl::Delta.

I see the same issue in FallbackEncoderImpl::DeltaLength too. I am assuming this should be fixed too. However I am not sure where this is used.

@alamb
Copy link
Contributor

alamb commented Jan 3, 2023

cc @tustvold

@alamb alamb added the parquet Changes to the parquet crate label Jan 3, 2023
@tustvold
Copy link
Contributor

tustvold commented Jan 3, 2023

FallbackEncoderImpl::DeltaLength

This will be used if the user has manually specified an encoding to use for the column, in addition to the default dictionary encoding

The fix would be to use std::mem::take(buffer) in FallbackEncoderImpl::Delta.

FWIW calling buffer.clear() might be better as it would allow reusing the backing allocation

This ever growing buffer is written to output every mini batch (1000 rows).

Eek, that would definitely cause issues. It is somewhat concerning that there is no test coverage of this, I guess the fuzz tests don't run into this as they are using a lower-level writer API

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants