-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-8733: [C++][Dataset][Python] Expose RowGroupInfo statistics values #7546
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
bkietz
commented
Jun 25, 2020
jorisvandenbossche
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, looks perfect!
The shape is fine I think, a dict of {col : {min: val, max: val}} seems the most logical / general structure to store it. And in the end, dask will need to massage it into some structure they need anyway (combining it for multiple fragments) cc @rjzamora
| } | ||
|
|
||
| return statistics | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to expose the statistics_expression as well? (not fully sure if it would have a use case, so maybe we should only do that if we have one)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If that's desired it can wait for a follow up
fsaintjacques
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, clean solution.
|
Thanks for the great work here @bkietz ! This is wonderful - Dask uses the min/max statistics to calculate
A note on other (less-critical, but useful) statistics: |
|
@rjzamora arrow/python/pyarrow/_dataset.pyx Lines 845 to 847 in cd3ed60
For the |
|
@rjzamora I opened https://issues.apache.org/jira/browse/ARROW-9346 to track the |