Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mapreduce: use simpler reduction order in some cases #45000

Open
nlw0 opened this issue Apr 16, 2022 · 12 comments
Open

mapreduce: use simpler reduction order in some cases #45000

nlw0 opened this issue Apr 16, 2022 · 12 comments

Comments

@nlw0
Copy link
Contributor

nlw0 commented Apr 16, 2022

I'm not sure this could be related to #43501 or not. I just noticed this code for the first time:

v1 = mapreduce_impl(f, op, A, ifirst, imid, blksize)

Are there any references about this implementation? When could we expect this doubly-recursive approach to be preferable to a linear traversal?

@simeonschaub
Copy link
Member

See the discussion in #4039 for example. For linear traversal just use mapfoldl.

@nlw0
Copy link
Contributor Author

nlw0 commented Apr 16, 2022

It's about accuracy, TIL...

@nlw0
Copy link
Contributor Author

nlw0 commented Apr 16, 2022

I have noticed this seems to be used for Integers as well, where the order of operations should not affect accuracy. sum(::Vector{Int64}) appears to use this method, giving it a penalty over a simple for-loop for smaller arrays.

sum over Int64 values is 50% to 150% slower than a for-loop up to 4k items. I understand the desire to offer the accurate sum for large float arrays, but I find this surprising. Would you consider this a worthy issue, @simeonschaub ?

image

@nlw0
Copy link
Contributor Author

nlw0 commented Apr 17, 2022

I've made an experiment trying to measure the speed and accuracy of sum versus a simple for-loop adding up values from a random vector. I might have made some mistake here, so I'm glad to hear any comments about the code: https://gist.github.com/nlw0/80eca5f6b873e9b75c760b8635c0a593

The main thing I'm not sure is how I'm supposed to measure the accuracy. I've chosen the median of absolute relative errors, but perhaps other measurements are more adequate. Also the distribution of values can probably influence, as well as the data type (I've used Float32), not to mention the machine etc.

And another caveat is that vectorization might have a positive effect in accuracy that should be taken into account. We are not necessarily doing a linear sum here, but a "for-loop with compiler optimizations on".

Here are the results:
image

Tentative conclusions:

  1. The accuracy of the pairwise approach used in sum (i.e. Base.mapreduce_impl) does appear to be limited, or growing very slowly, while the simple linear sum shows a steadily growing inaccuracy.
  2. The performance cost does not seem to be trivial until we reach pretty large vector lengths, ~100k values.
  3. Performance cost is as high a 5x (!) for smaller arrays.
  4. Simple summation resulted in better accuracy for smaller arrays. (insert here all those caveats)

I would suppose this experiment, if there's no major mistake, should incentivize at least trying to tune this block size parameter, or something. It's bigger than that, though because this function appears to work as the "de facto" mapreduce used in Julia, eg #43501 . As a result, anyone using sum for integers is paying a performance price with no accuracy benefit. And that seems to be the case even for small float arrays. I understand the original idea, but I would like to suggest taking a second look at when this function should be employed.

@N5N3
Copy link
Member

N5N3 commented Apr 20, 2022

IIUC, PWSUM has better accuracy when input's length exceeds 2048.
This might be machine dependent, but does show that the default pairwise_blocksize could be tuned more carefully.

As for the high "Performance cost" for small array, it should not be caused by the pairwise reduction, but the initialization.

julia> @btime simpsum($a); @btime pwsum($a, $1024);
  18.337 ns (0 allocations: 0 bytes)
  22.267 ns (0 allocations: 0 bytes)

julia> a = randn(Float32,64);

julia> @btime simpsum($a); @btime pwsum($a, $1024);
  5.400 ns (0 allocations: 0 bytes)
  25.978 ns (0 allocations: 0 bytes)

julia> a = randn(Float32,66);

julia> @btime simpsum($a); @btime pwsum($a, $1024);
  4.500 ns (0 allocations: 0 bytes)
  6.800 ns (0 allocations: 0 bytes)

LLVM doesn't handle unvectorized remainders well, this stongly affects the performance for inputs with small bitsize.

As an easy improvement, we can turn off pw-reduction if we know pw-reduction and simp-reduction are exactly the same.
e.g. +/* for HWInt, & / | / min / max for HWReal.
(It should be safe as the inference accuracy won't inference the result)

@nalimilan
Copy link
Member

@nlw0 Do you have evidence showing that integer sum could be significantly faster if we switched to a simple loop instead of using pairwise summation? I imagine this kind of change could be accepted (contrary to dropping pairwise summation for floating point).

@nlw0
Copy link
Contributor Author

nlw0 commented Oct 25, 2023

#45000 (comment)

@nalimilan Do you have any power to e.g. revert this closed issue back to open? It's a pretty simple experiment, but I'm not going to spend more time with this issue unless someone with power to do something concrete is listening. I've provided enough evidence and arguments, and it's easy to see from the theory why this piecewise reduction is bound to introduce obstacles to compiler optimizations. As much as I like discussing these things, I don't really see that I should spend more time with this issue. I have provided evidence and arguments, you could try disproving them already. As for not doing piecewise summation for floats, it's in the end a design decision, I happen to think it's a bad one, trying to save users from themselves. In the end it's up to the core designers. By the way, I have shown better precision and time for the non-piecewise summation, so even if we believe it's important to save users from themselves, piecewise summation is not doing it for some input lengths. A case of "victim of your own hubris" in my humble opinion. If anybody really does care either about floating point summation accuracy and/or running time, this is like a sore thumb.

@StefanKarpinski
Copy link
Member

I think there's a number of reasons why this issue hasn't gotten any traction. The title suggests that you want to always do linear traversal in mapreduce (that's the title, after all), which sounds like a guarantee, whereas having the option to do other reduction orders is pretty crucial for both speed and accuracy of floating-point reduction, and other cases. Even in the integer case, you really don't want to guarantee strict left-to-right reduction since that precludes SIMD reordering which is very important for performance. So let me assume that the actual intention of this issue is much milder: to point out that there are some specific cases where it might be better to use a different algorithm that we currently are using for mapreduce.

Another reason that I think you're having trouble getting traction is that there are too many different things being discussed here and some of them are convincing and some of them aren't. One fairly simple thing that seems to be very plausible is that it would be faster specifically for integers to use a straightforward SIMD reduction with no pairwise accumulation logic. Since integer + and * are associative, this has no effect on accuracy and is strictly about performance. It's an easy argument to make: change the implementation, show that it's faster. Should be an easy PR to get merged if someone wants to do it.

The floating-point accuracy thing is going to be a harder case to make. What I do know is that simple left-to-right accumulation (as the title and experiments suggest you're advocating for) is the worst possible algorithm in terms of both speed and accuracy, because it has the most skewed tree of operations. The more balanced the tree, the better the error behavior and that tends to be better for speed as well. Whether the current implementation is optimal or not is an entirely different matter, but I think much of the rejection of the issue comes from people (imo correctly) disbelieving that left-to-right reduction would be better.

So, in summary:

  • It's definitely possible to make improvements to mapreduce, but a guarantee of strict left-to-right reduction is not going to happen—we need to keep the latitude to reduce in a better order.
  • Neither integers nor floats will be faster with a strict left-to-right reduction which the title suggest—you need to at least take advantage of SIMD, which isn't going to be strictly left-to-right. It seems like what you're actually advocating for is a "simple" left-to-right modulo SIMD (I'll call this LTRS) reduction, but one really needs to get into the weeds of the issue to understand that.
  • The simplest case to be made kind of gets lost in the noise, which is that integers should use LTRS. I believe this entirely and think it was probably a mistake that they got included in the pairwise reduction logic—pairwise reduction is totally unnecessary overhead for associative numerical types like integers.
  • I also fully believe that lift-to-right modulo SIMD reduction for floats is also somewhat faster, but I'm skeptical of claims that it's more accurate; indeed, if I'm reading your experiments right, they show linear error growth for left-to-right and asymptotically constant error for pairwise reduction, which is what is theoretically expected.
  • I'm very curious about the LTRS error being lower for arrays smaller than 2^11; I can't understand why that would be the case. I wouldn't be surprised by the error being the same, but why would pairwise have worse error for smaller arrays? If we can understand why that's the case, it certainly suggests that the base case size for pairwise summation should be around 2^11 elements.

A suggestion for improving the summation comparison: use the Xsum package—it computes the exact sum of a collection of floats efficiently and more precisely, even, than using BigFloats.

@nlw0
Copy link
Contributor Author

nlw0 commented Nov 15, 2023

@StefanKarpinski Thanks so much for looking into this. Indeed, my point is not to use strict left-to-right reduction (or fold), but "modulo SIMD" as you said. Another way to put it is this: the default reduce (or mapreduce) implementation should be what you get from writing the reduce as a "naive for-loop", that will probably look like a left-to-right fold, but then will be subject to compiler optimizations such as SIMD vectorization.

My initial idea was just that we shouldn't use pairwise for Int or Bool, and maybe we can just write specialized implementations for these. My later though, though, is that it really should be the default case, and pairwise for Floats is what should be the exception with specialized implementations. And still I would argue it's better if it were a specialized separate function entirely, for the sake of keeping the implementation of core functions simple, and programmers dealing with floating-point should just know to pay attention to accuracy if it's relevant.

Regarding the better accuracy, I believe the SIMD vectorization in my case may have distributed the summation over more branches than the pairwise is doing. Pairwise is achieving the limiting of the error, but not the best accuracy for the type, and for that range the "implicit pairwise" caused by vectorization happened to divide the load more and then achieve a better accuracy.

@PallHaraldsson
Copy link
Contributor

Neither integers nor floats will be faster with a strict left-to-right reduction which the title suggest

Since the (original) question has been answered, shouldn't this issue been closed, after fixing the typo linar in the title? Also since less accurate to do so, for floats, so this will never happen.

Alternatively change the title to ".. linear (modulo SIMD) ..":

sum over Int64 values is 50% to 150% slower than a for-loop up to 4k items. I understand the desire to offer the accurate sum for large float arrays, but I find this surprising.

Julia needs to be concerned with accuracy; and large arrays. Small arrays are a constant (small) factor, slower. I'm not sure the same algorithm can be optimal for both. Isn't the algorithm by default used for small (fixed) sized array, or should those (from a package) implement another sum? A runtime check could be made to see if a large or small array is used, I'm not sure how costly that is (e.g. too large overhead for small arrays?). I suppose sum, here, is also relevant, for small slices, so potentially an optimization needs to be here, but not only in a package?

@nlw0 nlw0 changed the title Why not always linar traversal in mapreduce? Why not always linar (modulo SIMD) traversal in mapreduce? Nov 15, 2023
@nlw0
Copy link
Contributor Author

nlw0 commented Nov 15, 2023

Since the (original) question has been answered, shouldn't this issue been closed, after fixing the typo linar in the title? Also since less accurate to do so, for floats, so this will never happen.

It's perhaps important to emphasize, my main concern is that mapreduce is a very basic, fundamental function in my opinion, especially looking from a functional programming point of view. I'm just arguing about making that simpler. Whether sum uses pairwise summation is a different problem. I would personally prefer if pairwise is not implicit anywhere, but I can live with that.

What's a good target for a PR? Can we refactor mapreduce as long as pairwise summation is still retained for sum over floats? I'm not sure how easy it will be to track down all the implications. It would be nice to have a conversation about a strategy first, otherwise the PR might quickly create unforeseen objections.

Alternatively change the title to ".. linear (modulo SIMD) ..":

Done.

Julia needs to be concerned with accuracy; and large arrays. Small arrays are a constant (small) factor, slower. I'm not sure the same algorithm can be optimal for both. Isn't the algorithm by default used for small (fixed) sized array, or should those (from a package) implement another sum? A runtime check could be made to see if a large or small array is used, I'm not sure how costly that is (e.g. too large overhead for small arrays?). I suppose sum, here, is also relevant, for small slices, so potentially an optimization needs to be here, but not only in a package?

I would suggest default behavior should be in general to use a very simple algorithm, and potentially reap the benefits of compiler optimizations. Runtime checking for summation of a vector is a great topic, it touches the whole amazing topic of compiling matrix operations, for instance. Although it sounds like something that belongs in a higher level of abstraction, in my opinion.

@StefanKarpinski StefanKarpinski changed the title Why not always linar (modulo SIMD) traversal in mapreduce? mapreduce: use simpler reduction order in some cases Nov 15, 2023
@StefanKarpinski
Copy link
Member

Regarding the better accuracy, I believe the SIMD vectorization in my case may have distributed the summation over more branches than the pairwise is doing. Pairwise is achieving the limiting of the error, but not the best accuracy for the type, and for that range the "implicit pairwise" caused by vectorization happened to divide the load more and then achieve a better accuracy.

That's interesting and we should get to the bottom of this. I haven't dug into the code in a long time, but ideally the pairwise summation should have a base case that's about the size of the L1 cache that just does LTRS summation and then does pairwise on top of that, which is why I was confused about the difference. Doing pairwise all the way down is worse for performance and doesn't really help performance either.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants