-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sum does not use K-B-N summation for >1D arrays #1258
Comments
I believe we decided to use naive summation by default. Reasons:
|
Using naive summation by default is fine, although it would still be nice to have a way to have more accurate summation. My main real interest in this is that I often work with probability vectors with 20,000 values, and with naive summation, those may not sum to 1.0. julia> a = rand(20000,1);
julia> s = 0; for i = 1:length(a)
s = s + a[i]
end
julia> s
9954.844294400755
julia> sum(a)
9954.84429440082
julia> a1 = a/s;
julia> a2 = a/sum(a);
julia> sum(a1)
1.0000000000000064
julia> sum(a2)
0.9999999999999999 I guess this shows that KBN summation is better, but not perfect either... There are usually ways around the issue, if you know the problem exists. It might be worthwhile to specifically use KBN summation when normalizing probability vectors in multinomial and categorical distributions, especially if it is going away for the normal summation algorithm. |
Even precise floating-point summation may not help here. It's entirely possible that the normalized values simply don't add up to 1. |
True, but in my example, the KBN normalized sum was closer (and in the few Kevin On Sat, Oct 6, 2012 at 3:28 PM, Stefan Karpinski
|
I don't think I have anything intelligent to add here. It seems like it would be nice to have the option to produce more accurate behavior, but it's hard to say how important these differences are in any given application. The numbers being cited are differences in probabilities that I would guess could rarely be reliably measured in most data collection environments. In the multinomial draws case, it seems like you'd need trillions of samples to tell the difference between them, but I may be overlooking some obvious case where things break down. |
I think the issue here is less that the probabilities be accurate and more that it's problematic when they don't appear to sum to 1. But yes, we probably want a way to switch this, the issue is how and limited to what scope. |
Right, at some point in the past, I had some issues sampling from discrete distributions with very small probabilities when those probabilities did not add up to 1.0, so I was (am) concerned about that here. I tested this by drawing samples from an exponential distribution and normalizing, for various sizes (https://gist.github.com/3856163). Note that this tries to show the worst case by sorting from high to low before summing. Generally speaking, up to 100,000 values, the normalized summation error (difference from 1.0) is a couple of magnitudes (or more) lower than the smallest probability, so there's probably not much effect on the accuracy. If this were used, e.g., as a Multinomial distribution, there may be some small issues with the last element getting more or less probability mass than it should, with the exact issue being problem dependent. So I guess it's worth closing this issue and perhaps (re)opening a separate issue to provide the possibility of replacing standard summation with more accurate summing. This may be slightly challenging to provide universally, though, since e.g., the Multinomial distribution does it's own normalization because it checks for negative values. |
PS. For interested parties on this thread, note that we now use pairwise summation (#4039), which is often surprisingly close to Kahan summation for large arrays, but without the performance penalty. But we still use naive summation for sums along particular dimensions of multidimensional arrays. |
Python has had a specialized floating-point sum implementation, math.fsum, since version 2.6. Here are some recipes for various implementations with different characteristics. I think Julia ought to have a specialized floating-point summation algorithm given its emphasis on numerical computing. It is definitely something I would appreciate seeing added. Then, given documentation of the trade-offs, users could pick either the naive or the correct version. |
See @stevengj's comment right above – we use pairwise summation. Near-optimal performance, near-optimal accuracy, so it side-steps the trade-off altogether. However, since we're using a base case of iterative summation in that algorithm, we do fail the test case given in that article: julia> v = [1, 1e100, 1, -1e100] * 10000
4-element Array{Float64,1}:
10000.0
1.0e104
10000.0
-1.0e104
julia> sum(v)
0.0 |
And we also have |
julia> sum_kbn(v)
20000.0 |
Yes, I read issue #4039, but I was still concerned because the guarantees for pairwise summation might not be good enough for a particular application. KBN does better and Python's fsum does even better than that. KBN would be fine for my uses but I didn't see it documented (the suffix "kbn" is not intuitive to me, especially compared to something like "float_sum" or "fsum", but I found it now through search) and so did not know it was part of base rather than just part of someone's code. In particular, it might be helpful to mention sum_kbn under the documentation for sum. |
The documentation for |
We could really use a systematic approach for "See also:" in our documentation system. |
@timholy So that it renders to links on platforms that support it. |
The text was updated successfully, but these errors were encountered: