ARROW-8733: [C++][Dataset][Python] Expose RowGroupInfo statistics values #7546

bkietz · 2020-06-25T15:41:35Z

stats = parquet_fragment.row_groups[0].statistics
assert stats == {
  'normal_column': {'min': 1, 'max': 2},
  'all_null_column': {'min': None, 'max': None},
  'column_without_stats': None,
}

github-actions · 2020-06-25T15:46:36Z

https://issues.apache.org/jira/browse/ARROW-8733

jorisvandenbossche

Thanks, looks perfect!

The shape is fine I think, a dict of {col : {min: val, max: val}} seems the most logical / general structure to store it. And in the end, dask will need to massage it into some structure they need anyway (combining it for multiple fragments) cc @rjzamora

jorisvandenbossche · 2020-06-25T15:52:37Z

python/pyarrow/_dataset.pyx

+            }
+
+        return statistics
+


Do we want to expose the statistics_expression as well? (not fully sure if it would have a use case, so maybe we should only do that if we have one)

If that's desired it can wait for a follow up

fsaintjacques

LGTM, clean solution.

rjzamora · 2020-06-29T15:00:14Z

Thanks for the great work here @bkietz !

This is wonderful - Dask uses the min/max statistics to calculate divisions, so this functionality is definitely necessary.

 A note on other (less-critical, but useful) statistics:
Dask also uses the "total_byte_size" statistics (for the full row-group, not each column) to aggregate partitions before reading in any data. There is also a plan to use the "num-rows” statistics when the user executes len(ddf) (to avoid loading any data). How difficult would it be to add/expose these additional row-group statistics? Again, this is much less of a “blocker” for initial integration with Dask, but are likely things we will want to add in eventually. cc @jorisvandenbossche

jorisvandenbossche · 2020-07-01T15:46:06Z

@rjzamora num_rows is already available on the RowGroupInfo object (

arrow/python/pyarrow/_dataset.pyx

Lines 845 to 847 in cd3ed60

    
           @property 
        
           def num_rows(self): 
        
               return self.info.num_rows()

)

For the total_byte_size, can you open a JIRA for this? (it should be similar as num_rows to get / cache from the parquet row group, I think)

jorisvandenbossche · 2020-07-07T09:18:38Z

@rjzamora I opened https://issues.apache.org/jira/browse/ARROW-9346 to track the total_byte_size suggestion

ARROW-8733: [C++][Dataset][Python] Expose RowGroupInfo statistics values

3af592c

bkietz requested a review from jorisvandenbossche June 25, 2020 15:41

jorisvandenbossche reviewed Jun 25, 2020

View reviewed changes

jorisvandenbossche mentioned this pull request Jun 25, 2020

ARROW-8733: [Python][Dataset] Expose statistics of ParquetFileFragment::RowGroupInfo #7523

Closed

fsaintjacques approved these changes Jun 25, 2020

View reviewed changes

bkietz added 2 commits June 25, 2020 16:23

msvc: explicit cast

98b5e16

add python row group statistics test

67bc372

bkietz closed this in 83fac7a Jun 26, 2020

bkietz deleted the 8733-row-group-statistics branch February 25, 2021 16:31

This was referenced Jul 7, 2020

[C++][Dataset][Python] ParquetFileFragment should provide access to parquet FileMetadata #24885

Closed

[C++][Python][Dataset] Add total_byte_size metadata to RowGroupInfo #25430

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARROW-8733: [C++][Dataset][Python] Expose RowGroupInfo statistics values #7546

ARROW-8733: [C++][Dataset][Python] Expose RowGroupInfo statistics values #7546

Uh oh!

bkietz commented Jun 25, 2020

Uh oh!

github-actions bot commented Jun 25, 2020

Uh oh!

jorisvandenbossche left a comment

Uh oh!

jorisvandenbossche Jun 25, 2020

Uh oh!

bkietz Jun 25, 2020

Uh oh!

fsaintjacques left a comment

Uh oh!

rjzamora commented Jun 29, 2020

Uh oh!

jorisvandenbossche commented Jul 1, 2020

Uh oh!

jorisvandenbossche commented Jul 7, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ARROW-8733: [C++][Dataset][Python] Expose RowGroupInfo statistics values #7546

ARROW-8733: [C++][Dataset][Python] Expose RowGroupInfo statistics values #7546

Uh oh!

Conversation

bkietz commented Jun 25, 2020

Uh oh!

github-actions bot commented Jun 25, 2020

Uh oh!

jorisvandenbossche left a comment

Choose a reason for hiding this comment

Uh oh!

jorisvandenbossche Jun 25, 2020

Choose a reason for hiding this comment

Uh oh!

bkietz Jun 25, 2020

Choose a reason for hiding this comment

Uh oh!

fsaintjacques left a comment

Choose a reason for hiding this comment

Uh oh!

rjzamora commented Jun 29, 2020

Uh oh!

jorisvandenbossche commented Jul 1, 2020

Uh oh!

jorisvandenbossche commented Jul 7, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants