Skip to content

Conversation

@bkietz
Copy link
Member

@bkietz bkietz commented Jun 25, 2020

stats = parquet_fragment.row_groups[0].statistics
assert stats == {
  'normal_column': {'min': 1, 'max': 2},
  'all_null_column': {'min': None, 'max': None},
  'column_without_stats': None,
}

@github-actions
Copy link

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, looks perfect!

The shape is fine I think, a dict of {col : {min: val, max: val}} seems the most logical / general structure to store it. And in the end, dask will need to massage it into some structure they need anyway (combining it for multiple fragments) cc @rjzamora

}

return statistics

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to expose the statistics_expression as well? (not fully sure if it would have a use case, so maybe we should only do that if we have one)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If that's desired it can wait for a follow up

Copy link
Contributor

@fsaintjacques fsaintjacques left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, clean solution.

@bkietz bkietz closed this in 83fac7a Jun 26, 2020
@rjzamora
Copy link
Contributor

Thanks for the great work here @bkietz !

This is wonderful - Dask uses the min/max statistics to calculate divisions, so this functionality is definitely necessary.

A note on other (less-critical, but useful) statistics:
Dask also uses the "total_byte_size" statistics (for the full row-group, not each column) to aggregate partitions before reading in any data. There is also a plan to use the "num-rows” statistics when the user executes len(ddf) (to avoid loading any data). How difficult would it be to add/expose these additional row-group statistics? Again, this is much less of a “blocker” for initial integration with Dask, but are likely things we will want to add in eventually. cc @jorisvandenbossche

@jorisvandenbossche
Copy link
Member

@rjzamora num_rows is already available on the RowGroupInfo object (

@property
def num_rows(self):
return self.info.num_rows()
)

For the total_byte_size, can you open a JIRA for this? (it should be similar as num_rows to get / cache from the parquet row group, I think)

@jorisvandenbossche
Copy link
Member

@rjzamora I opened https://issues.apache.org/jira/browse/ARROW-9346 to track the total_byte_size suggestion

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants