Skip to content

Reductions #11

Open
Open
@datapythonista

Description

@datapythonista

Next are listed the reductions over numerical types defined in pandas. These can be applied:

  • To Series
  • To N columns of a DataFrame
  • To group by operations
  • As window functions (window, rolling, expanding or ewm)
  • In resample operations

pandas is not consistent, in letting any reduction to be applied to any of the above. Each method is
independent (Series.sum, GroupBy.sum, Window.sum...). Some reductions are not implemented for
some of the classes. And the signatures can change (e.g. Series.var(ddof) vs EWM.var(bias))

I propose to have standard signatures for the reductions, and have all reductions available to all classes.

Reductions for numerical data types and proposed signatures

  • all()
  • any()
  • count()
  • nunique() # may be the name could be count_unique, count_distinct...?
  • mode() # what to do if there is more than one mode? Ideally we would like all reductions to return a scalar
  • min()
  • max()
  • median()
  • quantile(q, interpolation='linear') # in pandas q is by default 0.5, but I think it's better to require it; interpolation can be {‘linear’, ‘lower’, ‘higher’, ‘midpoint’, ‘nearest’}
  • sum()
  • prod()
  • mean()
  • var(ddof=1) # delta degrees of freedom (for some classes bias is used)
  • std(ddof=1)
  • skew()
  • kurt() # pandas has also the alias kurtosis
  • sem(ddof=1) # standard error of the mean
  • mad() # mean absolute deviation
  • autocorr(lag=1)
  • is_unique() # in pandas is a property
  • is_monotonic() # in pandas is a property
  • is_monotonic_decreasing() # in pandas is a property
  • is_monotonic_increasing() # in pandas is a property

Reductions that may depend on row labels (and could potentially return a list, like mode):

  • idxmax() / argmax()
  • idxmin() / argmin()

These need an extra column other:

  • cov(other, ddof=1)
  • corr(other, method='pearson') # method can be {‘pearson’, ‘kendall’, ‘spearman’}

Questions

  • Allow reductions over rows, or only over columns?
  • What to do with NA?
  • pandas has parameters (bool_only, numeric_only) to let only apply the operation over columns of certain types only. Do we want it?
    • I think something like df.select_columns_by_dtype(int).sum() would be preferrable than a parameter to all or some reductions
  • pandas has a level parameter in many reductions, for MultiIndex. If Indexing/MultiIndexing is part of the API, do we want to have it?
  • pandas has a min_count/min_periods parameter in some reductions (e.g. sum, min), to return NA if less than min_count values are present. Do we want to keep it?
  • How should reductions be applied?
    • In the top-level namespace, as pandas (e.g. df[col].sum())
    • Using an accessor (e.g. df[col].reduce.sum())
    • Having a reduce function, and passing the specific functions as a parameter (e.g. df[col].reduce(sum))
    • Other ideas
  • Would it make sense to have a third-party package implementing reductions that can be reused by projects?

Frequency of usage

pandas_reductions

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions