-
-
Notifications
You must be signed in to change notification settings - Fork 152
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature requst: extend DataColumn API to read column values directly into provided Span/Memory/Array #507
Comments
Interesting, I have the exact same need to concatenate files: #515 |
It should be possible soon but needs some refactoring and possibly breaking changes. This library was created before Span existed ;) |
Rust crate to work with parquet files has really nice api (also loading data works 2 times faster (-:): let mut values = vec![];
...
for group_index in 0..group_count {
let group_reader = file_reader.get_row_group(group_index).unwrap();
let group_metadata = metadata.row_group(group_index);
let group_row_count = group_metadata.num_rows() as u64;
if let Ok(ColumnReader::Int96ColumnReader(ref mut column_reader)) = group_reader.get_column_reader(0) {
values.clear();
column_reader.read_records(group_row_count as usize, None, None, &mut values).unwrap();
};
} |
Usually when someone says "x times faster" it's a clickbait ;) Need to see performance measurement methodology and actual numbers. |
You mean like this one? ;) |
Yeah that's wrong. ParuqetSharp is actually slower, despite being a c++ wrapper. Notice lack of reference to benchmarking code, data set size, platform etc. Maybe I should publish detailed numbers and also put them on the front page :) |
Writing 1 million rows on Linux x64 with parquet.net vs parquetsharp:
Basically Parquet.Net is on average 3 times faster and uses less RAM, often considerably less. |
I was just kidding, no need to show off on the front page. 😅 |
Impressive results, keep it up! 👏 |
I was trying to tell that rust version is faster, not different implementation for c#. One of the reasons is that API allows to read into preallocated arrays and does not require GC. Here are the numbers I'm having when trying to read 4 columns, 412M rows from 38Gb file with 34 columns in 83 groups 5M rows each on windows:
|
Issue description
I have a usecase where I need to read rather large parquet files - 5Gb-50Gb, 100 to 10000 groups with 1_000_000-20_000_000 rows in a group. Groups sizes are limited and known beforehand and groups can be processed independently. So I would like to preallocate column value arrays once (or actually twice) and then read values from column directly into preallocated array/Span/Memory while iterating groups.
The text was updated successfully, but these errors were encountered: