Skip to content

[enh]: More permissive group_by().agg(...) for pyarrow #2385

@dangotbanned

Description

@dangotbanned

Related

Description

While working on (zen-xu/pyarrow-stubs#197), I learned some new things about pyarrow.Table.group_by.aggregate.

We could allow a wider range of functions

Currently we allow 10 different Expr aggregation methods:

"sum", "mean", "median", "max", "min", "std", "var", "len", "n_unique", "count"

For pyarrow, they utilize only 9/21 native functions, that are allowed in the native context.

9 native used

How are they mapped over?

class ArrowGroupBy(EagerGroupBy["ArrowDataFrame", "ArrowExpr"]):
_REMAP_AGGS: ClassVar[Mapping[NarwhalsAggregation, Any]] = {
"sum": "sum",
"mean": "mean",
"median": "approximate_median",
"max": "max",
"min": "min",
"std": "stddev",
"var": "variance",
"len": "count",
"n_unique": "count_distinct",
"count": "count",
}

Some also accept options for a small amount of flexibility.
We seem to be using pc.CountOptions and pc.VarianceOptions which is good 🙂.
But pc.ScalarAggregateOptions and pc.TDigestOptions may also be useful.

Projection

Now this was where it got interesting for me.
The docs provide an "annotation":

aggregations type?

class TableGroupBy
    def aggregate(self, aggregations):
        """
        Perform an aggregation over the grouped columns of the table.

        Parameters
        ----------
        aggregations : list[tuple(str, str)] or \
list[tuple(str, str, FunctionOptions)]
        """

Which we have been following in:

aggs: list[tuple[str, str, Any]] = []

However, the description in the docs immediately contradicts that by stating:

The column name can be a string, an empty list or a list of column names

Meaning instead of a single str column, we can do:

Updated stubs

UnarySelector: TypeAlias = str # <---------------- We only use this at the moment
NullarySelector: TypeAlias = tuple[()]
NarySelector: TypeAlias = list[str] | tuple[str, ...]

ColumnSelector: TypeAlias = UnarySelector | NullarySelector | NarySelector

Summary

I'm hoping together this could mean:

  1. We don't need to consider as many cases (for pyarrow) as complex
  2. Expressions that expand to multiple outputs can be performed more efficiently (natively)

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions