Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature requst: extend DataColumn API to read column values directly into provided Span/Memory/Array #507

Open
i-sinister opened this issue May 2, 2024 · 10 comments

Comments

@i-sinister
Copy link

Issue description

I have a usecase where I need to read rather large parquet files - 5Gb-50Gb, 100 to 10000 groups with 1_000_000-20_000_000 rows in a group. Groups sizes are limited and known beforehand and groups can be processed independently. So I would like to preallocate column value arrays once (or actually twice) and then read values from column directly into preallocated array/Span/Memory while iterating groups.

@Pragmateek
Copy link

Interesting, I have the exact same need to concatenate files: #515

@aloneguid
Copy link
Owner

It should be possible soon but needs some refactoring and possibly breaking changes. This library was created before Span existed ;)

@i-sinister
Copy link
Author

i-sinister commented Jun 6, 2024

Rust crate to work with parquet files has really nice api (also loading data works 2 times faster (-:):
https://docs.rs/parquet/51.0.0/parquet/column/reader/struct.GenericColumnReader.html#method.read_records

let mut values = vec![];
...
for group_index in 0..group_count {
            let group_reader = file_reader.get_row_group(group_index).unwrap();
            let group_metadata = metadata.row_group(group_index);
            let group_row_count = group_metadata.num_rows() as u64;
            if let Ok(ColumnReader::Int96ColumnReader(ref mut column_reader)) = group_reader.get_column_reader(0) {
                values.clear();
                column_reader.read_records(group_row_count as usize, None, None, &mut values).unwrap();
            };
}

@aloneguid
Copy link
Owner

Usually when someone says "x times faster" it's a clickbait ;) Need to see performance measurement methodology and actual numbers.

@Pragmateek
Copy link

Usually when someone says "x times faster" it's a clickbait ;) Need to see performance measurement methodology and actual numbers.

You mean like this one? ;)

@aloneguid
Copy link
Owner

Yeah that's wrong. ParuqetSharp is actually slower, despite being a c++ wrapper. Notice lack of reference to benchmarking code, data set size, platform etc. Maybe I should publish detailed numbers and also put them on the front page :)

@aloneguid
Copy link
Owner

Writing 1 million rows on Linux x64 with parquet.net vs parquetsharp:

Method DataType Mean Error StdDev Gen0 Gen1 Gen2 Allocated
ParquetNet Double 17.423 ms 10.6128 ms 0.5817 ms 187.5000 187.5000 187.5000 19463.68 KB
ParquetSharp Double 31.025 ms 15.2193 ms 0.8342 ms 937.5000 937.5000 937.5000 19615.25 KB
ParquetNet Int32 6.098 ms 2.4644 ms 0.1351 ms 187.5000 - - 774.65 KB
ParquetSharp Int32 22.634 ms 4.2477 ms 0.2328 ms 1000.0000 1000.0000 1000.0000 19900.72 KB
ParquetNet Double? 1.375 ms 1.3348 ms 0.0732 ms 48.8281 1.9531 - 198.42 KB
ParquetSharp Double? 4.003 ms 1.0387 ms 0.0569 ms 54.6875 7.8125 - 249.16 KB
ParquetNet Int32? 1.071 ms 0.6018 ms 0.0330 ms 1.9531 - - 14.11 KB
ParquetSharp Int32? 3.434 ms 0.6685 ms 0.0366 ms 54.6875 15.6250 - 237.34 KB

Basically Parquet.Net is on average 3 times faster and uses less RAM, often considerably less.

@Pragmateek
Copy link

Pragmateek commented Jun 7, 2024

Yeah that's wrong. ParuqetSharp is actually slower, despite being a c++ wrapper. Notice lack of reference to benchmarking code, data set size, platform etc. Maybe I should publish detailed numbers and also put them on the front page :)

I was just kidding, no need to show off on the front page. 😅
For me the real killer feature is the bidirectional serialization, kind of Object Parquet Mapping.

@Pragmateek
Copy link

Writing 1 million rows on Linux x64 with parquet.net vs parquetsharp:

Method DataType Mean Error StdDev Gen0 Gen1 Gen2 Allocated
ParquetNet Double 17.423 ms 10.6128 ms 0.5817 ms 187.5000 187.5000 187.5000 19463.68 KB
ParquetSharp Double 31.025 ms 15.2193 ms 0.8342 ms 937.5000 937.5000 937.5000 19615.25 KB
ParquetNet Int32 6.098 ms 2.4644 ms 0.1351 ms 187.5000 - - 774.65 KB
ParquetSharp Int32 22.634 ms 4.2477 ms 0.2328 ms 1000.0000 1000.0000 1000.0000 19900.72 KB
ParquetNet Double? 1.375 ms 1.3348 ms 0.0732 ms 48.8281 1.9531 - 198.42 KB
ParquetSharp Double? 4.003 ms 1.0387 ms 0.0569 ms 54.6875 7.8125 - 249.16 KB
ParquetNet Int32? 1.071 ms 0.6018 ms 0.0330 ms 1.9531 - - 14.11 KB
ParquetSharp Int32? 3.434 ms 0.6685 ms 0.0366 ms 54.6875 15.6250 - 237.34 KB
Basically Parquet.Net is on average 3 times faster and uses less RAM, often considerably less.

Impressive results, keep it up! 👏

@i-sinister
Copy link
Author

I was trying to tell that rust version is faster, not different implementation for c#. One of the reasons is that API allows to read into preallocated arrays and does not require GC.

Here are the numbers I'm having when trying to read 4 columns, 412M rows from 38Gb file with 34 columns in 83 groups 5M rows each on windows:

version duration
rust read 412395458 rows in 83 groups in 12.44s
net8.0 read 412395458 rows in 83 groups in 00:00:19.1050095

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants