Skip to content

Consolidate statistics aggregation #8229

@alamb

Description

@alamb

Is your feature request related to a problem or challenge?

There are at least three places in DataFusion where multiple Statistics objects are aggregated together, and they do so inconsistently:

  1. get_statistics_with_limit: https://github.com/apache/arrow-datafusion/blob/e54894c39202815b14d9e7eae58f64d3a269c165/datafusion/core/src/datasource/statistics.rs#L34-L33
    2 . Parquet::infer_stats: https://github.com/apache/arrow-datafusion/blob/a892300a5a56c97b5b4ddc9aa4a421aaf412d0fe/datafusion/core/src/datasource/file_format/parquet.rs#L503-L581
  2. Union::statistics: https://github.com/apache/arrow-datafusion/blob/c2e768052c43e4bab6705ee76befc19de383c2cb/datafusion/physical-plan/src/union.rs#L612-L611

(and we actually have another version of this in IOx)

Describe the solution you'd like

I would like to consolidate the three implementations into a StatisticsAggregator that knows how to aggregate multiple Statistics objects that is both documented and well tested.

Describe alternatives you've considered

No response

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions