Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal to standardize basic statistical functions #10

Closed
kgryte opened this issue Jul 9, 2020 · 4 comments
Closed

Proposal to standardize basic statistical functions #10

kgryte opened this issue Jul 9, 2020 · 4 comments

Comments

@kgryte
Copy link
Contributor

kgryte commented Jul 9, 2020

Based on the analysis of array library APIs, we know that performing basic statistical functions is both universally implemented and commonly used. Accordingly, this issue proposes to standardize the following functions:

Functions

  • mean
  • prod
  • std
  • sum
  • var
  • min (minimum)
  • max (maximum)

Criterion

  1. Commonly implemented across array libraries.
  2. Commonly used by array library consumers.
  3. Operates on one array.

Questions

  1. Are there any APIs listed above which should not be standardized?
  2. Are there basic statistical functions not listed above which should be standardized? Preferably, any additions should be supported by usage data.
  3. Should the standard recommend algorithms for increased portability? For reductions, mandating minimum precision requirements is more fraught than for the evaluation of elementary mathematical functions.
@rgommers
Copy link
Member

rgommers commented Jul 9, 2020

I think it would be useful to also list functions that are left out in this proposal format, that's quite useful in assessing the proposal. I think you have that info at hand via the categorization you used, right?

For this one, an obvious omission is median. I think it's probably good to leave it out, because Dask does not (and cannot efficiently) implement it. I think explicitly documenting such "we left out function X because of reason Y" is very useful.

@kgryte
Copy link
Contributor Author

kgryte commented Jul 9, 2020

@rgommers Re: median. You're right. I left out the median in this OP as I believe it may be more difficult to standardize and ensure universal implementation, given its--and other order statistics, such as quartiles, percentiles, and quantiles--need for at least partial sorting, which may not be possible/desirable in a distributed context.

The above APIs are those found in the intersection of all analyzed array libraries. There may be some other statistical reductions which are relatively common and can be additional candidates for standardization, possibly discussed in a separate issue thread.

@kgryte
Copy link
Contributor Author

kgryte commented Jul 20, 2020

I compiled generalized signatures (with respect to each of the above listed interfaces for each library) for basic statistical functions, where the raw signature data can be found here.

NumPy

numpy.<name>(a, axis=None, dtype=None, out=None, keepdims=<no value>) → ndarray

CuPy

cupy.<name>(a, axis=None, dtype=None, out=None, keepdims=False) → ndarray

dask.array

dask.array.<name>(a, axis=None, dtype=None, keepdims=False, split_every=None, out=None) → ndarray

JAX

jax.numpy.<name>(a, axis=None, dtype=None, out=None, keepdims=False) → ndarray

MXNet

np.<name>(a, axis=None, dtype=None, out=None, keepdims=False) → ndarray

PyTorch

torch.<name>(input, dim, keepdim=False, out=None) → Tensor

Tensorflow

tf.math.reduce_<name>(input_tensor, axis=None, keepdims=False, name=None) → Tensor

The minimum common API across most libraries is

<name>(x, axis=None, keepdims=False, out=None)

For example,

mean(x, axis=None, keepdims=False, out=None)

Proposal

Signature of the form:

<name>(x, *, axis=None, keepdims=False, out=None)

APIs:

mean(x, *, axis=None, keepdims=False, out=None)
var(x, *, axis=None, keepdims=False, out=None)
std(x, *, axis=None, keepdims=False, out=None)
sum(x, *, axis=None, keepdims=False, out=None)
prod(x, *, axis=None, keepdims=False, out=None)
min(x, *, axis=None, keepdims=False, out=None)
max(x, *, axis=None, keepdims=False, out=None)

Notes

  • Statistical functions are not as easily generalizable in a broad sense, as certain functions may benefit from additional parameters. For example,

    • an initial value when computing the sum/product

      numpy.sum(a, axis=None, dtype=None, out=None, keepdims=<no value>, initial=<no value>, where=<no value>) → ndarray
      
    • a correction factor when computing the variance

      numpy.var(a, axis=None, dtype=None, out=None, ddof=0, keepdims=<no value>) → ndarray
      

    Accordingly, we'll likely want to investigate slightly bespoke interfaces for each function, but the above should provide a common starting point.

  • Optional arguments as keyword-only arguments for the following reasons:

    1. Avoid potential positional variation amongst library implementations.
    2. Favor explicit interfaces and minimize readers' need to intuit an optional positional argument's meaning.

Questions

  1. Other keyword arguments?
  2. Function-specific keyword arguments?

@kgryte kgryte changed the title Proposal to standardize basic statistical reductions Proposal to standardize basic statistical functions Aug 17, 2020
@rgommers
Copy link
Member

This was done, closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants