Open
Description
Next are listed the reductions over numerical types defined in pandas. These can be applied:
- To
Series
- To N columns of a
DataFrame
- To group by operations
- As window functions (window, rolling, expanding or ewm)
- In resample operations
pandas is not consistent, in letting any reduction to be applied to any of the above. Each method is
independent (Series.sum
, GroupBy.sum
, Window.sum
...). Some reductions are not implemented for
some of the classes. And the signatures can change (e.g. Series.var(ddof)
vs EWM.var(bias)
)
I propose to have standard signatures for the reductions, and have all reductions available to all classes.
Reductions for numerical data types and proposed signatures
all()
any()
count()
nunique()
# may be the name could becount_unique
,count_distinct
...?mode()
# what to do if there is more than one mode? Ideally we would like all reductions to return a scalarmin()
max()
median()
quantile(q, interpolation='linear')
# in pandasq
is by default0.5
, but I think it's better to require it; interpolation can be {‘linear’, ‘lower’, ‘higher’, ‘midpoint’, ‘nearest’}sum()
prod()
mean()
var(ddof=1)
# delta degrees of freedom (for some classesbias
is used)std(ddof=1)
skew()
kurt()
# pandas has also the aliaskurtosis
sem(ddof=1)
# standard error of the meanmad()
# mean absolute deviationautocorr(lag=1)
is_unique()
# in pandas is a propertyis_monotonic()
# in pandas is a propertyis_monotonic_decreasing()
# in pandas is a propertyis_monotonic_increasing()
# in pandas is a property
Reductions that may depend on row labels (and could potentially return a list, like mode
):
idxmax()
/argmax()
idxmin()
/argmin()
These need an extra column other
:
cov(other, ddof=1)
corr(other, method='pearson')
# method can be {‘pearson’, ‘kendall’, ‘spearman’}
Questions
- Allow reductions over rows, or only over columns?
- What to do with NA?
- pandas has parameters (
bool_only
,numeric_only
) to let only apply the operation over columns of certain types only. Do we want it?- I think something like
df.select_columns_by_dtype(int).sum()
would be preferrable than a parameter to all or some reductions
- I think something like
- pandas has a
level
parameter in many reductions, for MultiIndex. If Indexing/MultiIndexing is part of the API, do we want to have it? - pandas has a
min_count
/min_periods
parameter in some reductions (e.g.sum
,min
), to returnNA
if less thanmin_count
values are present. Do we want to keep it? - How should reductions be applied?
- In the top-level namespace, as pandas (e.g.
df[col].sum()
) - Using an accessor (e.g.
df[col].reduce.sum()
) - Having a
reduce
function, and passing the specific functions as a parameter (e.g.df[col].reduce(sum)
) - Other ideas
- In the top-level namespace, as pandas (e.g.
- Would it make sense to have a third-party package implementing reductions that can be reused by projects?
Frequency of usage
Metadata
Metadata
Assignees
Labels
No labels