Feature requst: extend DataColumn API to read column values directly into provided Span/Memory/Array #507

i-sinister · 2024-05-02T12:47:54Z

Issue description

I have a usecase where I need to read rather large parquet files - 5Gb-50Gb, 100 to 10000 groups with 1_000_000-20_000_000 rows in a group. Groups sizes are limited and known beforehand and groups can be processed independently. So I would like to preallocate column value arrays once (or actually twice) and then read values from column directly into preallocated array/Span/Memory while iterating groups.

Pragmateek · 2024-06-03T12:57:13Z

Interesting, I have the exact same need to concatenate files: #515

aloneguid · 2024-06-06T12:30:00Z

It should be possible soon but needs some refactoring and possibly breaking changes. This library was created before Span existed ;)

i-sinister · 2024-06-06T13:20:41Z

Rust crate to work with parquet files has really nice api (also loading data works 2 times faster (-:):
https://docs.rs/parquet/51.0.0/parquet/column/reader/struct.GenericColumnReader.html#method.read_records

let mut values = vec![];
...
for group_index in 0..group_count {
            let group_reader = file_reader.get_row_group(group_index).unwrap();
            let group_metadata = metadata.row_group(group_index);
            let group_row_count = group_metadata.num_rows() as u64;
            if let Ok(ColumnReader::Int96ColumnReader(ref mut column_reader)) = group_reader.get_column_reader(0) {
                values.clear();
                column_reader.read_records(group_row_count as usize, None, None, &mut values).unwrap();
            };
}

aloneguid · 2024-06-06T15:18:25Z

Usually when someone says "x times faster" it's a clickbait ;) Need to see performance measurement methodology and actual numbers.

Pragmateek · 2024-06-07T09:27:14Z

Usually when someone says "x times faster" it's a clickbait ;) Need to see performance measurement methodology and actual numbers.

You mean like this one? ;)

aloneguid · 2024-06-07T11:17:43Z

Yeah that's wrong. ParuqetSharp is actually slower, despite being a c++ wrapper. Notice lack of reference to benchmarking code, data set size, platform etc. Maybe I should publish detailed numbers and also put them on the front page :)

aloneguid · 2024-06-07T11:35:12Z

Writing 1 million rows on Linux x64 with parquet.net vs parquetsharp:

Method	DataType	Mean	Error	StdDev	Gen0	Gen1	Gen2	Allocated
ParquetNet	Double	17.423 ms	10.6128 ms	0.5817 ms	187.5000	187.5000	187.5000	19463.68 KB
ParquetSharp	Double	31.025 ms	15.2193 ms	0.8342 ms	937.5000	937.5000	937.5000	19615.25 KB
ParquetNet	Int32	6.098 ms	2.4644 ms	0.1351 ms	187.5000	-	-	774.65 KB
ParquetSharp	Int32	22.634 ms	4.2477 ms	0.2328 ms	1000.0000	1000.0000	1000.0000	19900.72 KB
ParquetNet	Double?	1.375 ms	1.3348 ms	0.0732 ms	48.8281	1.9531	-	198.42 KB
ParquetSharp	Double?	4.003 ms	1.0387 ms	0.0569 ms	54.6875	7.8125	-	249.16 KB
ParquetNet	Int32?	1.071 ms	0.6018 ms	0.0330 ms	1.9531	-	-	14.11 KB
ParquetSharp	Int32?	3.434 ms	0.6685 ms	0.0366 ms	54.6875	15.6250	-	237.34 KB

Basically Parquet.Net is on average 3 times faster and uses less RAM, often considerably less.

Pragmateek · 2024-06-07T11:42:15Z

Yeah that's wrong. ParuqetSharp is actually slower, despite being a c++ wrapper. Notice lack of reference to benchmarking code, data set size, platform etc. Maybe I should publish detailed numbers and also put them on the front page :)

I was just kidding, no need to show off on the front page. 😅
For me the real killer feature is the bidirectional serialization, kind of Object Parquet Mapping.

Pragmateek · 2024-06-07T11:46:18Z

Writing 1 million rows on Linux x64 with parquet.net vs parquetsharp:

Method DataType Mean Error StdDev Gen0 Gen1 Gen2 Allocated
ParquetNet Double 17.423 ms 10.6128 ms 0.5817 ms 187.5000 187.5000 187.5000 19463.68 KB
ParquetSharp Double 31.025 ms 15.2193 ms 0.8342 ms 937.5000 937.5000 937.5000 19615.25 KB
ParquetNet Int32 6.098 ms 2.4644 ms 0.1351 ms 187.5000 - - 774.65 KB
ParquetSharp Int32 22.634 ms 4.2477 ms 0.2328 ms 1000.0000 1000.0000 1000.0000 19900.72 KB
ParquetNet Double? 1.375 ms 1.3348 ms 0.0732 ms 48.8281 1.9531 - 198.42 KB
ParquetSharp Double? 4.003 ms 1.0387 ms 0.0569 ms 54.6875 7.8125 - 249.16 KB
ParquetNet Int32? 1.071 ms 0.6018 ms 0.0330 ms 1.9531 - - 14.11 KB
ParquetSharp Int32? 3.434 ms 0.6685 ms 0.0366 ms 54.6875 15.6250 - 237.34 KB
Basically Parquet.Net is on average 3 times faster and uses less RAM, often considerably less.

Impressive results, keep it up! 👏

i-sinister · 2024-06-10T05:54:03Z

I was trying to tell that rust version is faster, not different implementation for c#. One of the reasons is that API allows to read into preallocated arrays and does not require GC.

Here are the numbers I'm having when trying to read 4 columns, 412M rows from 38Gb file with 34 columns in 83 groups 5M rows each on windows:

version	duration
rust	read 412395458 rows in 83 groups in 12.44s
net8.0	read 412395458 rows in 83 groups in 00:00:19.1050095

aloneguid added the future improvement label Nov 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature requst: extend DataColumn API to read column values directly into provided Span/Memory/Array #507

Feature requst: extend DataColumn API to read column values directly into provided Span/Memory/Array #507

i-sinister commented May 2, 2024

Pragmateek commented Jun 3, 2024

aloneguid commented Jun 6, 2024

i-sinister commented Jun 6, 2024 •

edited

Loading

aloneguid commented Jun 6, 2024

Pragmateek commented Jun 7, 2024

aloneguid commented Jun 7, 2024

aloneguid commented Jun 7, 2024

Pragmateek commented Jun 7, 2024 •

edited

Loading

Pragmateek commented Jun 7, 2024

i-sinister commented Jun 10, 2024

Feature requst: extend DataColumn API to read column values directly into provided Span/Memory/Array #507

Feature requst: extend DataColumn API to read column values directly into provided Span/Memory/Array #507

Comments

i-sinister commented May 2, 2024

Issue description

Pragmateek commented Jun 3, 2024

aloneguid commented Jun 6, 2024

i-sinister commented Jun 6, 2024 • edited Loading

aloneguid commented Jun 6, 2024

Pragmateek commented Jun 7, 2024

aloneguid commented Jun 7, 2024

aloneguid commented Jun 7, 2024

Pragmateek commented Jun 7, 2024 • edited Loading

Pragmateek commented Jun 7, 2024

i-sinister commented Jun 10, 2024

i-sinister commented Jun 6, 2024 •

edited

Loading

Pragmateek commented Jun 7, 2024 •

edited

Loading