parquet-fromcsv with writer version v2 does not stop #3408

XinyuZeng · 2022-12-29T16:16:25Z

Describe the bug

When using the executable parquet-fromcsv to convert csv into parquet and set writer version to 2, the program won't stop. The result Parquet file's size grows indefinitely until filling up the disk.

To Reproduce

I've run with two different schema and csv files, both failed. One is from TPCH lineitem table SF10. The command is:
parquet-fromcsv -s test_schema.txt -i core_test.csv -o core_test.parquet -w 2

Without -w 2 works fine.

The TPCH lineitem schema file is:

message schema {
  optional int64 l_orderkey;
  optional int64 l_partkey;
  optional int64 l_suppkey;
  optional int64 l_linenumber;
  optional int64 l_quantity;
  optional double l_extendedprice;
  optional double l_discount;
  optional double l_tax;
  optional binary l_returnflag (String);
  optional binary l_linestatus (String);
  optional binary l_shipdate (String);
  optional binary l_commitdate (String);
  optional binary l_receiptdate (String);
  optional binary l_shipinstruct (String);
  optional binary l_shipmode (String);
  optional binary l_comment (String);
}

CSV file can be generated using TPCH tools.

Expected behavior

Additional context

The text was updated successfully, but these errors were encountered:

alamb · 2022-12-31T12:52:04Z

Thank you for the report @XinyuZeng !

askoa · 2023-01-03T16:24:32Z

I'll pick this up. The issue is with the column l_comment. This column exhausts dict_encoder quickly and starts using FallbackEncoder.

In V1 the FallbackEncoderImpl::PLAIN encoder is used where the buffer is cleared in flush_data_page

arrow-rs/parquet/src/arrow/arrow_writer/byte_array.rs

Lines 282 to 284 in 08a976f

    
           FallbackEncoderImpl::Plain { buffer } => { 
        
               (std::mem::take(buffer), Encoding::PLAIN) 
        
           }

In V2 the FallbackEncoderImpl::Delta encoder is used where the buffer is not cleared in flush_data_page

arrow-rs/parquet/src/arrow/arrow_writer/byte_array.rs

Line 307 in 08a976f

out.extend_from_slice(buffer);

The effect of this is ever growing buffer size. This ever growing buffer is written to output every mini batch (1000 rows). Hence the program's output consumes all the disk space. The fix would be to use std::mem::take(buffer) in FallbackEncoderImpl::Delta.

I see the same issue in FallbackEncoderImpl::DeltaLength too. I am assuming this should be fixed too. However I am not sure where this is used.

alamb · 2023-01-03T16:32:18Z

cc @tustvold

tustvold · 2023-01-03T16:34:47Z

FallbackEncoderImpl::DeltaLength

This will be used if the user has manually specified an encoding to use for the column, in addition to the default dictionary encoding

The fix would be to use std::mem::take(buffer) in FallbackEncoderImpl::Delta.

FWIW calling buffer.clear() might be better as it would allow reusing the backing allocation

This ever growing buffer is written to output every mini batch (1000 rows).

Eek, that would definitely cause issues. It is somewhat concerning that there is no test coverage of this, I guess the fuzz tests don't run into this as they are using a lower-level writer API

XinyuZeng added the bug label Dec 29, 2022

alamb added the parquet Changes to the parquet crate label Jan 3, 2023

tustvold assigned askoa Jan 3, 2023

askoa mentioned this issue Jan 4, 2023

Parquet writer v2: clear buffer after page flush #3447

Merged

tustvold closed this as completed in #3447 Jan 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

parquet-fromcsv with writer version v2 does not stop #3408

parquet-fromcsv with writer version v2 does not stop #3408

XinyuZeng commented Dec 29, 2022 •

edited

Loading

alamb commented Dec 31, 2022

askoa commented Jan 3, 2023

alamb commented Jan 3, 2023

tustvold commented Jan 3, 2023 •

edited

Loading

parquet-fromcsv with writer version v2 does not stop #3408

parquet-fromcsv with writer version v2 does not stop #3408

Comments

XinyuZeng commented Dec 29, 2022 • edited Loading

alamb commented Dec 31, 2022

askoa commented Jan 3, 2023

alamb commented Jan 3, 2023

tustvold commented Jan 3, 2023 • edited Loading

XinyuZeng commented Dec 29, 2022 •

edited

Loading

tustvold commented Jan 3, 2023 •

edited

Loading