-
Notifications
You must be signed in to change notification settings - Fork 176
Description
Related
- Focused on introducing expressions in
group_by(...). - General
- feat: Fully spec
TableGroupBy.aggregatezen-xu/pyarrow-stubs#197 - feat: Fully spec
TableGroupBy.aggregatezen-xu/pyarrow-stubs#197 (comment)
Description
While working on (zen-xu/pyarrow-stubs#197), I learned some new things about pyarrow.Table.group_by.aggregate.
We could allow a wider range of functions
Currently we allow 10 different Expr aggregation methods:
narwhals/narwhals/_compliant/group_by.py
Line 52 in 1642744
| "sum", "mean", "median", "max", "min", "std", "var", "len", "n_unique", "count" |
For pyarrow, they utilize only 9/21 native functions, that are allowed in the native context.
9 native used
How are they mapped over?
narwhals/narwhals/_arrow/group_by.py
Lines 29 to 41 in 1642744
| class ArrowGroupBy(EagerGroupBy["ArrowDataFrame", "ArrowExpr"]): | |
| _REMAP_AGGS: ClassVar[Mapping[NarwhalsAggregation, Any]] = { | |
| "sum": "sum", | |
| "mean": "mean", | |
| "median": "approximate_median", | |
| "max": "max", | |
| "min": "min", | |
| "std": "stddev", | |
| "var": "variance", | |
| "len": "count", | |
| "n_unique": "count_distinct", | |
| "count": "count", | |
| } |
Some also accept options for a small amount of flexibility.
We seem to be using pc.CountOptions and pc.VarianceOptions which is good 🙂.
But pc.ScalarAggregateOptions and pc.TDigestOptions may also be useful.
Projection
Now this was where it got interesting for me.
The docs provide an "annotation":
aggregations type?
class TableGroupBy
def aggregate(self, aggregations):
"""
Perform an aggregation over the grouped columns of the table.
Parameters
----------
aggregations : list[tuple(str, str)] or \
list[tuple(str, str, FunctionOptions)]
"""Which we have been following in:
narwhals/narwhals/_arrow/group_by.py
Line 60 in 1642744
| aggs: list[tuple[str, str, Any]] = [] |
However, the description in the docs immediately contradicts that by stating:
The column name can be a string, an empty list or a list of column names
Meaning instead of a single str column, we can do:
UnarySelector: TypeAlias = str # <---------------- We only use this at the moment
NullarySelector: TypeAlias = tuple[()]
NarySelector: TypeAlias = list[str] | tuple[str, ...]
ColumnSelector: TypeAlias = UnarySelector | NullarySelector | NarySelectorSummary
I'm hoping together this could mean:
- We don't need to consider as many cases (for
pyarrow) as complex - Expressions that expand to multiple outputs can be performed more efficiently (natively)