-
Notifications
You must be signed in to change notification settings - Fork 838
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Potentially buffer multiple RecordBatches
before writing a parquet row group in ArrowWriter
#1214
Conversation
let offsets = offsets | ||
.to_vec() | ||
.into_iter() | ||
.skip(offset) | ||
.skip(array.offset() + offset) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was a pre-existing bug, that people would have run into if they wrote a sliced RecordBatch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Filed #1226 to track (so that we can document this in the release notes)
will try to review this tomorrow |
I opted to update the default max row group size, and clarify the docs as discussed in #1213 |
Codecov Report
@@ Coverage Diff @@
## master #1214 +/- ##
==========================================
+ Coverage 82.96% 83.18% +0.22%
==========================================
Files 178 179 +1
Lines 51522 51950 +428
==========================================
+ Hits 42744 43216 +472
+ Misses 8778 8734 -44
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. Thank you @tustvold
I am a little worried about the performance hit of thetake
invocation -- I wonder if it is possible to avoid in the common case?
Otherwise, looks good to me 👍 thank you
let offsets = offsets | ||
.to_vec() | ||
.into_iter() | ||
.skip(offset) | ||
.skip(array.offset() + offset) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Filed #1226 to track (so that we can document this in the release notes)
Moving back to draft as currently working on other things, will come back to this shortly |
I've pushed code that eliminates the use of concat, I'm fairly confident it is correct but I will write a few more tests before I mark this ready for review again |
@@ -74,7 +74,7 @@ fn create_table(results: &[RecordBatch]) -> Result<Table> { | |||
let mut cells = Vec::new(); | |||
for col in 0..batch.num_columns() { | |||
let column = batch.column(col); | |||
cells.push(Cell::new(&array_value_to_string(&column, row)?)); | |||
cells.push(Cell::new(&array_value_to_string(column, row)?)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These changes are needed because clippy now finds this file as the prettyprint feature is enabled by parquet
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I took a good look through this and it looks good to me 👍
cc @nevi-me should he want to review this as well |
RecordBatches
before writing a parquet row group in ArrowWriter
Which issue does this PR close?
Closes #1211 .
Closes #1226
Rationale for this change
See ticket
What changes are included in this PR?
Changes
ArrowWriter
to produce row groups with max_row_group_size rows except for the final row group in the file.Are there any user-facing changes?
Yes,
ArrowWriter
will now buffer up data prior to flush, producing larger batches in the process. This could be made an opt-in change, but I think this is probably what a lot of people, myself included, thought the writer did.On a related note, I think the default max row group size is a tad high given it is used as a row threshold and not a bytes threshold - I've created #1213 to track this