-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC: Add weights argument to sum #33310
Conversation
This will be consistent with the `weights` argument to be added to similar functions in Statistics. Since that code is needed notably to computed the weighted `mean` in Statistics, I figured it is better to make it public rather than only defining internal helpers for `mean`. This is based on code from StatsBase, of which one part needs to live in LinearAlgebra since it depends on BLAS and `dot`.
If `dims` is provided, return an array of sums over these dimensions. | ||
If `weights` is provided, return the weighted sum(s). `weights` must be | ||
either an array of the same size as `A` if `dims` is omitted, | ||
or a vector with the same length as `size(A, dims)` if `dims` is provided. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why should weights
have to be an array, as opposed to just an iterable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's how things work currently in StatsBase. I guess we could make this more general at least for the simple case where dims=:
. But that would require adding a separate method as @simd
only works when using indices.
I wonder whether this should be an iterator instead? i.e. you could imagine: sum(Iterators.multiply(x, weights)) |
s = add_sum(s0, s0) | ||
@inbounds @simd for i in eachindex(A, w) | ||
s += A[i] * w[i] | ||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will be much less accurate than the non-weighted algorithm based on pairwise summation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, probably (I haven't considered this, that's just a port of the StatsBase code with some cleanup). We could improve this, but note that for BlasReal
the BLAS-optimized dot
is called instead, which has different properties anyway (not sure whether it's going to be more accurate or less).
If we want to preserve accuracy, that would be an argument in favor of this PR, since a naive broadcast
call won't use pairwise summation.
In particular, I could imagine an algorithm in which
Both of these could then be passed to the existing (It could also be made even more generic by passing a binary Seems like it would take less code than what you are doing now, would be more accurate (for arrays), and would be more composable. |
Or we could just wait for |
But with regards to adding this functionality to |
@stevengj Wouldn't |
As I wrote in the description, the main point of this PR is that we're going to add the Another reason is that the current behavior (implemented in StatsBase) differs from A final reason can be that the API from this PR allows using efficient and/or accurate algorithms under the hood, e.g. pairwise summation or BLAS-powered If we decide that the weighted sum/mean over dimensions use case isn't worth all that code, then yes, this PR isn't very useful. |
This would be handled if we can use
In principle, we can handle |
Ah, yeah, I guess we could add special methods for that. Not sure whether that's OK given that it would give different results from the general algorithm, but since increased accuracy is a good thing maybe that's not a problem. The same applies to using |
We should have this. If one wants a weighted tollerance for a cerain amount of missing data then one can right now get weighted relative confidance like:
But if you want to do this with an absolute count,
This generalizes to over dimentions.
Currently harder to write the absolute tolerance version |
My worry is that this opens the door to adding weights support everywhere. Why can't this be done externally in say a |
Well keyword arguments cannot be handled in packages unfortunately... The current design in StatsBase relies on dispatch, with methods like Also, people have often complained that they expected e.g. |
I also support the kwarg approach. |
Maybe it was already considered, but why not use |
|
theoretically, |
I disagree: it's often convenient to associate values and weights together a single time and thereafter use them as though they were just values (implicitly weighted). Using a lazy construct like |
As noted before, broadcasting wouldn't work as the weights vector is broadcast to the dimension over which reduction in performed, rather than always treated as a column vector. But |
Is this getting merged? I support using keyword arguments instead of types, in which case this would need to live here as opposed to |
For the record, something I hadn't realized when I wrote my comments above: So the wrapper approach would only be used for dispatch, and the resulting object would implement almost no methods. Each function would have to opt-in to support it as appropriate in its particular case. |
Given that we want to move Statistics out (#46501), perhaps we should close this and adopt what we need in Statistics.jl? |
This will be consistent with the
weights
argument to be added to similar functions in Statistics. Since that code is needed notably to computed the weightedmean
in Statistics, I figured it is better to make it public rather than only defining internal helpers formean
.This is based on code from StatsBase, of which one part needs to live in LinearAlgebra since it depends on BLAS and
dot
.Extracted from #31395.