Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: a simple cov #10

Merged
merged 1 commit into from
Oct 26, 2024
Merged

ENH: a simple cov #10

merged 1 commit into from
Oct 26, 2024

Conversation

lucascolley
Copy link
Collaborator

This would replace a private utility in SciPy.

@lucascolley lucascolley changed the title ENH: a simple cov ENH: a simple cov and complex mean Sep 25, 2024
@lucascolley lucascolley added enhancement New feature or request new function labels Sep 25, 2024
@@ -46,3 +47,48 @@ def atleast_nd(x: Array, *, ndim: int, xp: ModuleType) -> Array:
x = xp.expand_dims(x, axis=0)
x = atleast_nd(x, ndim=ndim, xp=xp)
return x


def cov(x: Array, *, xp: ModuleType) -> Array:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While this implementation and API is constrained enough, I would exercise caution when adding such APIs to ensure that the design of whatever is included in this library is consistent with Array API design principles and does not impose limitations on future standardization.

While not applicable in this specific instance, I would like to avoid a scenario where array_api_extra provides a backdoor to de facto standardized APIs which have design choices that we wouldn't want to endorse were we standardizing directly in the standard. Meaning, I'd like to avoid getting backed into a corner.

In this case, the particular design aspect which comes to mind is the behavior of a correction argument, similar to what we have in std and var.

In general, my current thinking is that, so long as this is under the Consortium umbrella, potential APIs need to go through a somewhat similar "standardization" proposal and inclusion process. That means comparative library analysis, discussion of design choices, and whether inclusion presents an opportunity to address limitations in prior art.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a very good point, and we need to think more about API stability here. At the same time, this could to some degree defeat the purpose of the library if we have to do a full standardization process for every function. I wonder if there's a way we could organize things here so that API stability isn't as big of a concern. For instance, we could make it so that array API extra is only supported as a vendored library (or requiring people to just copy-paste the implementations). Or we could organize the code in a way so that different versions of APIs are possible.

Copy link
Collaborator Author

@lucascolley lucascolley Sep 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feedback @kgryte ! Agreed that we want to avoid creating de facto standardized APIs here with less care than would be put into the standard. I can see how it would be bad to guide consumer libraries towards whatever APIs are here, just to tell them that they need to change again when a more considered API is standardized.

I do think that this library should still be able to move more quickly than the standard though - the primary motivation being providing functionality to consumer libraries via some reasonable API, rather than convincing array libraries that this is the "correct" API (there is no need since everything is implemented in terms of the standard). Having a non-ideal API here is (arguably) still better than making consumer library maintainers search through private modules of other libraries like SciPy or implement the function themselves, provided that we can sufficiently communicate that an API being a part of this library does not mean that it is standardized.

If we were to go through a very similar process to the standard, and the result was consensus on an API, then it should probably just be added to the standard itself!

I think one way we could try to avoid creating de facto standardized APIs is being quite liberal in adding kwargs to vary behaviour when there are competing ideas (i.e. competing use-cases in consumer libraries). Then, while functions live in this library, consumer libraries can pick and choose the functionality they need without any particular mode being "standard". If at some point some "extra" functionality is considered for standardization, then the normal process would apply, but with the additional information of how the API from this library has been used.

That means comparative library analysis, discussion of design choices, and whether inclusion presents an opportunity to address limitations in prior art.

This would be fantastic to see, but given that I'm heading back to university soon, I am unlikely to be able to contribute the necessary time to carry out that process rigorously for the foreseeable future. So development here may be quite slow. Hopefully you can get others to fill that role while I'm too busy! I do think that this library will be an important part of making adoption of the standard more easy for consumer libraries.

EDIT: The same point about time applies to other potential contributors. I think we may be able to create some nice collaboration where consumer libraries take from and give back to xpx by writing a function or two, which in turn will help other consumer libraries. I think we are less likely to convince consumer library maintainers to enter discussions about an ideal API (they may well not really care about the API at all, they just need something that works).

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think one way we could try to avoid creating de facto standardized APIs is being quite liberal in adding kwargs to vary behaviour when there are competing ideas (i.e. competing use-cases in consumer libraries).

I'd go the opposite way: namely, being very conservative in terms of what behavior is supported.

For some background on Array API "extras", this idea has been discussed during workgroup meetings at various points for different reasons. One of those reasons is a potential place for APIs which have broad, but not ubiquitous, consensus among ecosystem libraries. E.g., a number of ecosystem libraries have other special functions, such as gamma, erf, beta, etc. Some of these are unlikely to be standardized directly in the standard due to the NumPy/SciPy split. Other libraries have APIs associated with deep learning applications (e.g., various loss functions), but these are unlikely to ever be directly added to more general purpose libraries such as NumPy. "Extras" was batted around as a way to provide consensus APIs outside of the standard, provided such APIs could either be implemented in terms of the standard or delegated to libraries exposing a compliant API.

With this in mind, similar to the standard, we'd be best served to be biased toward parsimony, not abundance.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This stated, this isn't to say that "extras" cannot have non-consensus APIs (i.e., APIs which have no precedent in array libraries). Those can be in scope. However, for APIs which do have precedent in array libraries, IMO, we should tread a bit more carefully to ensure what's implemented is aligned behavior.

@lucascolley
Copy link
Collaborator Author

lucascolley commented Sep 28, 2024

cov

This PR implements a small subset of numpy.cov, namely cov(m), to replace a private helper in SciPy.

  • NumPy, JAX and PyTorch all agree on the same default behaviour for single-argument cov (unbiased estimate)
  • No particular design choices for this API.
  • As mentioned above by @kgryte, if this function is extended to support more arguments of the existing implementations, there is potential to address some prior limitations of the NumPy API by using a correction argument like PyTorch and other statistical functions in the standard. Need for that hasn't arisen yet though, so that isn't attempted here.

Good to proceed? EDIT: edited to make mean private.

@rgommers
Copy link
Member

Why add mean? It seems from data-apis/array-api#846 that it's only an oversight that complex support wasn't added, and that can be done. I would just finish that discussion, and then likely nothing else has to change (except perhaps deleting a check in array-api-compat). Moving usages of mean over to this package and then back a little later sound like quite a bit of churn for no real reason.

@lucascolley
Copy link
Collaborator Author

We could keep complex mean private here until there is a reported use case for it, and remove it once it is supported in array-api-compat - I think that sounds like a better idea than exposing it just because we can.

@rgommers
Copy link
Member

That sounds good to me.

@lucascolley lucascolley changed the title ENH: a simple cov and complex mean ENH: a simple cov Sep 29, 2024
@lucascolley lucascolley marked this pull request as ready for review September 29, 2024 14:48
@asmeurer
Copy link
Member

and then likely nothing else has to change (except perhaps deleting a check in array-api-compat)

I guess you mean array-api-strict. Complex mean already works with the upstream libraries (at least NumPy and PyTorch, I didn't check the others), and will work with array-api-compat.

Also, since array-api-strict now supports multiple versions, we can easily add a preliminary 2024.12 version and disable the complex check in mean for that version, even before that version of the standard is released.

TST: cov: add another test

DOC: cov, mean: WIP docstrings

API: make `mean` private

DOC: cov: finish docstring
Copy link
Collaborator Author

@lucascolley lucascolley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is ready to go in, I think

@lucascolley lucascolley merged commit 548103c into data-apis:main Oct 26, 2024
7 checks passed
@kgryte
Copy link

kgryte commented Oct 26, 2024

NumPy, JAX and PyTorch all agree on the same default behaviour for single-argument cov (unbiased estimate)

FWIW, TensorFlow disagrees here.

This aside, it is somewhat unfortunate that the default behavior in NumPy for cov (bias = False) diverges from the default behavior of var and std. The use of bias appears to be legacy, being subsequently supplanted by ddof. If we were designing from first principles and considering adding to the standard, I'd argue for being consistent with existing precedent in the standard, rather than having diverging conventions, especially as, while N - 1 is common, it is by no means the universally appropriate normalization factor, and my inclination would be to force users to be explicit about the normalization factor they believe suitable.

@rgommers
Copy link
Member

If we were designing from first principles and considering adding to the standard, I'd argue for being consistent with existing precedent in the standard

But we aren't here, so I think matching what NumPy, JAX and PyTorch was fine for this PR. Making changes just makes it harder to use, so let's not cross that bridge until we have to for standardization (which may well be never).

@kgryte
Copy link

kgryte commented Oct 27, 2024

But we aren't here... Making changes just makes it harder to use

I'm not sure that is the case. TMK, Array API extras is not restricted to being a compat layer, and as an independent library, could be free to make design choices which (mildly in this case) diverge from existing libraries such as NumPy which carry historical context.

The notion of "harder to use" is somewhat in the eye of the beholder here. Yes, following existing NumPy practice (which is based on a bias boolean kwarg which would be highly unlikely to be standardized given standardized var and std convention) would mean that downstream libraries such as SciPy could simply swap np.cov for xpx.cov, but this also means persisting the additional cognitive burden among future code readers in recalling that default conventions differ between the univariate and multivariate cases. Personally, I place greater value on the latter than the former, and more so in this case as Array API extras has some freedom to evolve the story a bit.

@lucascolley lucascolley deleted the cov branch October 27, 2024 07:57
@rgommers
Copy link
Member

rgommers commented Nov 3, 2024

but this also means persisting the additional cognitive burden among future code readers in recalling that default conventions differ between the univariate and multivariate cases

I don't think this is an extra burden, I'm reasonably confident that matching cov in other libraries is more important and less confusing on balance than matching std/var would be. You'd be right for a reader that knows the math, doesn't know or use any other array library, and comes at this fresh using only the standard and array-api-extra - but that will be rare.

@lucascolley lucascolley mentioned this pull request Nov 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request new function
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants