-
Notifications
You must be signed in to change notification settings - Fork 793
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parquet: derive boundary order when writing #5110
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I worry a bit about the performance impact of this, especially as I think for strings it will perform multiple allocations per-value.
Perhaps we might do something simpler whereby the user can specify the boundary order via WriterProperties or something? This would also avoid potential issues where the boundary order configuration becomes data-dependent, which users might find surprising
The casting back to Could try investigate pushing this check logic earlier up the chain, such as to here: arrow-rs/parquet/src/column/writer/mod.rs Lines 612 to 667 in df69ef5
Since I think Alternatively could try find a way to get Could also try make
I'm not sure this should be writer configurable, as if I'm understanding correctly then |
My understanding of the use-case for this feature is where you do know this, e.g. because you've configured the query producing the data to ensure this. |
I see your point, though the implementations of arrow c++ And impala: Seem to point towards this property being dynamically derived, at least from the code snippets I've seen |
Aah I see this is being done per-page, not per-value that is a lot less problematic 😄
Might be worth a try, if it turns into a mess let me know and we can proceed as written here |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tried pushing the boundary check logic up out of ColumnIndexBuilder
If this approach is preferable then will proceed with this and add the tests (or can revert to previous approach)
parquet/src/column/writer/mod.rs
Outdated
let null_page = (self.page_metrics.num_buffered_rows as u64) | ||
== self.page_metrics.num_page_nulls; | ||
if !null_page { | ||
if let Some((latest_min, latest_max)) = &self.latest_non_null_data_page_min_max | ||
{ | ||
if self.data_page_boundary_ascending { | ||
// If latest min/max are greater than new min/max then not ascending anymore | ||
let not_ascending = compare_greater(&self.descr, latest_min, &min) | ||
|| compare_greater(&self.descr, latest_max, &max); | ||
if not_ascending { | ||
self.data_page_boundary_ascending = false; | ||
} | ||
} | ||
|
||
if self.data_page_boundary_descending { | ||
// If new min/max are greater than latest min/max then not descending anymore | ||
let not_descending = compare_greater(&self.descr, &min, latest_min) | ||
|| compare_greater(&self.descr, &max, latest_max); | ||
if not_descending { | ||
self.data_page_boundary_descending = false; | ||
} | ||
} | ||
} | ||
self.latest_non_null_data_page_min_max = Some((min.clone(), max.clone())); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tried putting the incremental check here instead of inside update_column_offset_index(..)
as I couldn't figure out an easy way to get the T: ParquetValueType
out of a Statistics
enum
One caveat about putting this check here is that it compares the min/maxes before truncation occurs, though I think this should still be ok.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whilst I think this is correct, perhaps you could just change update_column_offset_index to instead take Option<&ValueStatistics<T>>
? This would likely make the existing logic faster, and would make this logic perhaps slightly easier to follow - it is a little surprising that boundary_order is being updated outside of update_column_offset_index
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like where this is headed, sorry for taking so long to review
parquet/src/column/writer/mod.rs
Outdated
let null_page = (self.page_metrics.num_buffered_rows as u64) | ||
== self.page_metrics.num_page_nulls; | ||
if !null_page { | ||
if let Some((latest_min, latest_max)) = &self.latest_non_null_data_page_min_max | ||
{ | ||
if self.data_page_boundary_ascending { | ||
// If latest min/max are greater than new min/max then not ascending anymore | ||
let not_ascending = compare_greater(&self.descr, latest_min, &min) | ||
|| compare_greater(&self.descr, latest_max, &max); | ||
if not_ascending { | ||
self.data_page_boundary_ascending = false; | ||
} | ||
} | ||
|
||
if self.data_page_boundary_descending { | ||
// If new min/max are greater than latest min/max then not descending anymore | ||
let not_descending = compare_greater(&self.descr, &min, latest_min) | ||
|| compare_greater(&self.descr, &max, latest_max); | ||
if not_descending { | ||
self.data_page_boundary_descending = false; | ||
} | ||
} | ||
} | ||
self.latest_non_null_data_page_min_max = Some((min.clone(), max.clone())); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whilst I think this is correct, perhaps you could just change update_column_offset_index to instead take Option<&ValueStatistics<T>>
? This would likely make the existing logic faster, and would make this logic perhaps slightly easier to follow - it is a little surprising that boundary_order is being updated outside of update_column_offset_index
I've refactored as per suggestion, and it worked out quite well |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good, my only question concerns the correctness of the null handling, I couldn't find information on how nulls are supposed to be handled...
&[ | ||
&[Some(-10), Some(10)], | ||
&[Some(-5), Some(11)], | ||
&[None], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there some documentation of this behaviour in the parquet spec, or failing that an example implementation doing similar. Normally I would have expected a null to break the ordering as it would break the ability to do binary search
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is this example test in arrow c++ that shows null pages doesn't have an effect on ordering:
Same here:
Though I do agree that the Parquet spec is lacking explicit documentation on how null pages are handled
Which issue does this PR close?
Closes #5074
Rationale for this change
What changes are included in this PR?
Change ColumnIndexBuilder to add new append method which incrementally keeps track of sort state of min/max values of data pages to eventually derive the boundary_order thrift field
Are there any user-facing changes?