-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
mapreduce: use simpler reduction order in some cases #45000
Comments
See the discussion in #4039 for example. For linear traversal just use |
It's about accuracy, TIL... |
I have noticed this seems to be used for Integers as well, where the order of operations should not affect accuracy.
|
I've made an experiment trying to measure the speed and accuracy of The main thing I'm not sure is how I'm supposed to measure the accuracy. I've chosen the median of absolute relative errors, but perhaps other measurements are more adequate. Also the distribution of values can probably influence, as well as the data type (I've used Float32), not to mention the machine etc. And another caveat is that vectorization might have a positive effect in accuracy that should be taken into account. We are not necessarily doing a linear sum here, but a "for-loop with compiler optimizations on". Tentative conclusions:
I would suppose this experiment, if there's no major mistake, should incentivize at least trying to tune this block size parameter, or something. It's bigger than that, though because this function appears to work as the "de facto" mapreduce used in Julia, eg #43501 . As a result, anyone using |
IIUC, PWSUM has better accuracy when input's length exceeds 2048. As for the high "Performance cost" for small array, it should not be caused by the pairwise reduction, but the initialization. julia> @btime simpsum($a); @btime pwsum($a, $1024);
18.337 ns (0 allocations: 0 bytes)
22.267 ns (0 allocations: 0 bytes)
julia> a = randn(Float32,64);
julia> @btime simpsum($a); @btime pwsum($a, $1024);
5.400 ns (0 allocations: 0 bytes)
25.978 ns (0 allocations: 0 bytes)
julia> a = randn(Float32,66);
julia> @btime simpsum($a); @btime pwsum($a, $1024);
4.500 ns (0 allocations: 0 bytes)
6.800 ns (0 allocations: 0 bytes) LLVM doesn't handle unvectorized remainders well, this stongly affects the performance for inputs with small bitsize. As an easy improvement, we can turn off pw-reduction if we know |
@nlw0 Do you have evidence showing that integer sum could be significantly faster if we switched to a simple loop instead of using pairwise summation? I imagine this kind of change could be accepted (contrary to dropping pairwise summation for floating point). |
@nalimilan Do you have any power to e.g. revert this closed issue back to open? It's a pretty simple experiment, but I'm not going to spend more time with this issue unless someone with power to do something concrete is listening. I've provided enough evidence and arguments, and it's easy to see from the theory why this piecewise reduction is bound to introduce obstacles to compiler optimizations. As much as I like discussing these things, I don't really see that I should spend more time with this issue. I have provided evidence and arguments, you could try disproving them already. As for not doing piecewise summation for floats, it's in the end a design decision, I happen to think it's a bad one, trying to save users from themselves. In the end it's up to the core designers. By the way, I have shown better precision and time for the non-piecewise summation, so even if we believe it's important to save users from themselves, piecewise summation is not doing it for some input lengths. A case of "victim of your own hubris" in my humble opinion. If anybody really does care either about floating point summation accuracy and/or running time, this is like a sore thumb. |
I think there's a number of reasons why this issue hasn't gotten any traction. The title suggests that you want to always do linear traversal in mapreduce (that's the title, after all), which sounds like a guarantee, whereas having the option to do other reduction orders is pretty crucial for both speed and accuracy of floating-point reduction, and other cases. Even in the integer case, you really don't want to guarantee strict left-to-right reduction since that precludes SIMD reordering which is very important for performance. So let me assume that the actual intention of this issue is much milder: to point out that there are some specific cases where it might be better to use a different algorithm that we currently are using for Another reason that I think you're having trouble getting traction is that there are too many different things being discussed here and some of them are convincing and some of them aren't. One fairly simple thing that seems to be very plausible is that it would be faster specifically for integers to use a straightforward SIMD reduction with no pairwise accumulation logic. Since integer The floating-point accuracy thing is going to be a harder case to make. What I do know is that simple left-to-right accumulation (as the title and experiments suggest you're advocating for) is the worst possible algorithm in terms of both speed and accuracy, because it has the most skewed tree of operations. The more balanced the tree, the better the error behavior and that tends to be better for speed as well. Whether the current implementation is optimal or not is an entirely different matter, but I think much of the rejection of the issue comes from people (imo correctly) disbelieving that left-to-right reduction would be better. So, in summary:
A suggestion for improving the summation comparison: use the Xsum package—it computes the exact sum of a collection of floats efficiently and more precisely, even, than using BigFloats. |
@StefanKarpinski Thanks so much for looking into this. Indeed, my point is not to use strict left-to-right reduction (or fold), but "modulo SIMD" as you said. Another way to put it is this: the default reduce (or mapreduce) implementation should be what you get from writing the reduce as a "naive for-loop", that will probably look like a left-to-right fold, but then will be subject to compiler optimizations such as SIMD vectorization. My initial idea was just that we shouldn't use pairwise for Int or Bool, and maybe we can just write specialized implementations for these. My later though, though, is that it really should be the default case, and pairwise for Floats is what should be the exception with specialized implementations. And still I would argue it's better if it were a specialized separate function entirely, for the sake of keeping the implementation of core functions simple, and programmers dealing with floating-point should just know to pay attention to accuracy if it's relevant. Regarding the better accuracy, I believe the SIMD vectorization in my case may have distributed the summation over more branches than the pairwise is doing. Pairwise is achieving the limiting of the error, but not the best accuracy for the type, and for that range the "implicit pairwise" caused by vectorization happened to divide the load more and then achieve a better accuracy. |
Since the (original) question has been answered, shouldn't this issue been closed, after fixing the typo linar in the title? Also since less accurate to do so, for floats, so this will never happen. Alternatively change the title to ".. linear (modulo SIMD) ..":
Julia needs to be concerned with accuracy; and large arrays. Small arrays are a constant (small) factor, slower. I'm not sure the same algorithm can be optimal for both. Isn't the algorithm by default used for small (fixed) sized array, or should those (from a package) implement another sum? A runtime check could be made to see if a large or small array is used, I'm not sure how costly that is (e.g. too large overhead for small arrays?). I suppose sum, here, is also relevant, for small slices, so potentially an optimization needs to be here, but not only in a package? |
It's perhaps important to emphasize, my main concern is that What's a good target for a PR? Can we refactor mapreduce as long as pairwise summation is still retained for
Done.
I would suggest default behavior should be in general to use a very simple algorithm, and potentially reap the benefits of compiler optimizations. Runtime checking for summation of a vector is a great topic, it touches the whole amazing topic of compiling matrix operations, for instance. Although it sounds like something that belongs in a higher level of abstraction, in my opinion. |
That's interesting and we should get to the bottom of this. I haven't dug into the code in a long time, but ideally the pairwise summation should have a base case that's about the size of the L1 cache that just does LTRS summation and then does pairwise on top of that, which is why I was confused about the difference. Doing pairwise all the way down is worse for performance and doesn't really help performance either. |
I'm not sure this could be related to #43501 or not. I just noticed this code for the first time:
julia/base/reduce.jl
Line 258 in 6106e6c
Are there any references about this implementation? When could we expect this doubly-recursive approach to be preferable to a linear traversal?
The text was updated successfully, but these errors were encountered: