Reducing over bool array fails to optimize somehow #43501

jakobnissen · 2021-12-21T14:59:22Z

Minimal example

julia> using BenchmarkTools

julia> function foo(x)
           y = false
           @inbounds for i in eachindex(x)
               y |= x[i]
           end
           y
       end
foo (generic function with 1 method)

julia> data = fill(false, 2^12);

julia> @btime reduce(|, data)
  300.850 ns (0 allocations: 0 bytes)
false

julia> @btime reduce(|, data, init=false)
  166.408 ns (0 allocations: 0 bytes)
false

julia> @btime foo(data)
  26.813 ns (0 allocations: 0 bytes)
false

Shouldn't reduce without the init keyword first find the initial value, then call the second method? In that case, it should not be 5x slower.

Edit: Also, for some reason, this does not vectorize.

The text was updated successfully, but these errors were encountered:

stev47 · 2022-01-13T08:58:24Z

The keyword-variant calls foldl_impl and the non-keyword variant apparently calls mapreduce_impl, maybe @tkf can comment?

nlw0 · 2022-04-16T09:10:07Z

A similar experiment with + and Int64 also gives a similar result:

julia> using BenchmarkTools

julia> function foo(x)
           y = 0
           @inbounds for i in eachindex(x)
               y += x[i]
           end
           y
       end
foo (generic function with 1 method)

julia> dataint = fill(0, 2^12);

julia> @btime reduce(+, dataint)
  445.783 ns (0 allocations: 0 bytes)
0

julia> @btime reduce(+, dataint, init=0)
  443.990 ns (0 allocations: 0 bytes)
0

julia> @btime foo(dataint)
  233.726 ns (0 allocations: 0 bytes)
0

nlw0 · 2022-04-16T15:10:37Z

Even sum doesn't beat the for-loop for integers, although it does for floats (I'm even worried about the float sum performance with for loops!)

julia> @btime mapreduce(identity, +, dataint)
  527.974 ns (0 allocations: 0 bytes)
0

julia> @btime sum(dataint)
  354.445 ns (0 allocations: 0 bytes)
0

julia> @btime fooint(dataint)
  205.389 ns (0 allocations: 0 bytes)
0

I'm not familiar with how these functions are implemented, but the impression I got is that they eventually call a Base._foldl_impl method that is generic, based on iterate, and perhaps that doesn't get optimized. Is there a specialized version for arrays that should have been called instead?

nlw0 · 2022-04-16T19:54:46Z

Not sure I made any mistakes here. Like I said, I'm not too familiar with this code.

Tracing back what function is being called exactly, this is what I found using @edit, and the timing is consistent:

@btime foobool($databool) # 41us
@btime reduce(|, $databool) # 520ns
@edit reduce(|,databool)
@btime mapreduce(identity, |, $databool)
@edit mapreduce(identity, |, databool)
@btime Base._mapreduce_dim(identity, |, Base._InitialValue(), $databool, :)
@edit Base._mapreduce_dim(identity, |, Base._InitialValue(), databool, :)
@btime Base._mapreduce(identity, |, IndexStyle($databool), $databool)# 520ns
@edit Base._mapreduce(identity, |, IndexStyle(databool), databool)

That would mean we are using an implementation of mapreduce that performs a binary-tree like traversal of the data, instead of linear. I wasn't sure yet this was the case when I noticed this function, so I opened this other issue #45000

Here's a measurement of the time difference between reduce and the for-loop for |. There's a plot for + and Int64 on that issue.

My guess is that it's overhead from the algorithm, and it gets exacerbated in the boolean case. I suppose Bool and Int do not benefit from the pairwise summation scheme. It would probably be nice to make sure this algorithm is only used for floating-point element types, while anything else uses a simple linear traversal approach. Or otherwise, there should be a version of mapreduce_impl and _mapreduce that get used for element types that do not benefit from pairwise summation schemes, and then we can specialize for them.

Alternatively, we could increase the block size based on the operations, but also on the element type. Although _mapreduce has a fixed threshold of 16 for linear traversal, so it doesn't solve this boolean case. That assuming it's really just about replacing this method, and probably call mapfoldl as @simeonschaub mentions in #45000. That only really solves it for Int, though, still leaving a puzzle in the Bool case.

julia> @btime mapfoldl(identity, +, $dataint);
  375.175 μs (0 allocations: 0 bytes)
julia> @btime fooint($dataint);
  367.436 μs (0 allocations: 0 bytes)
julia> @btime mapfoldl(identity, |, $databool);
  105.097 ns (0 allocations: 0 bytes)
julia> @btime foobool($databool);
  41.287 ns (0 allocations: 0 bytes)

gbaraldi · 2022-04-16T23:04:58Z

Folds all have some weird performance issues. See #43310

N5N3 · 2022-04-17T08:22:32Z

The pairwise reduction could be turned off by

Base.pairwise_blocksize(f, ::typeof(|)) = typemax(Int) # enlarge the threshold to avoid split.

then we have

julia> @btime reduce(|, $data)
  88.438 ns (0 allocations: 0 bytes)  # 268.111 ns before.

Still slower than foo, but much better.

The remaining difference comes from the initialization: foo use false, while mapreduce_impl use data[1]|data[2].
The following is a MWE

function foo(x)
    y = false
    @inbounds for i in 1:length(x)
        y |= x[i]
    end
    y
end
function foo2(x)
    @inbounds y = x[1]|x[2]
    @inbounds for i in 3:length(x)
        y |= x[i]
    end
    y
end
@btime foo2($data)  #   83.212 ns (0 allocations: 0 bytes)
@btime foo($data)  # 37.059 ns (0 allocations: 0 bytes)

nlw0 · 2022-04-20T06:16:25Z

@N5N3 this is great. I fear the problem may be even larger than doing that for this one operation, though. Would you mind commenting on #45000 ? Perhaps I should not have created a second issue. But the point is I believe if we start looking around, we might actually like to raise this block length to many more operations. I don't think there are many desirable applications for that pairwise mapreduce_impl other than accurate summation with floating-point compared to all the general uses for reduce and fold, and mapreduce_impl seem to be used in most cases. Furthermore, there may actually be better alternatives for accurate summation, something that perhaps is only really viable with more modern processors. Anyways, this suggests we might like to make a larger refactoring that just tuning this parameter for a few more cases such as | with Bool.

jakobnissen · 2022-05-04T09:04:13Z

Okay, it's clear the issue is that mapreduce defaults to the pairwise blocked implementation, which makes sense for floats, but not for other data types.

jakobnissen · 2022-05-04T09:19:26Z

Whoops looks like I was wrong

stev47 added fold sum, maximum, reduce, foldl, etc. performance Must go faster labels Jan 13, 2022

nlw0 mentioned this issue Apr 16, 2022

mapreduce: use simpler reduction order in some cases #45000

Open

jakobnissen changed the title ~~Reducing over bool array fails to optimize somehow~~ Mapreduce wrongly defaults to blocked pairwise implementation May 4, 2022

jakobnissen changed the title ~~Mapreduce wrongly defaults to blocked pairwise implementation~~ Reducing over bool array fails to optimize somehow May 4, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reducing over bool array fails to optimize somehow #43501

Reducing over bool array fails to optimize somehow #43501

jakobnissen commented Dec 21, 2021 •

edited

Loading

stev47 commented Jan 13, 2022

nlw0 commented Apr 16, 2022

nlw0 commented Apr 16, 2022

nlw0 commented Apr 16, 2022 •

edited

Loading

gbaraldi commented Apr 16, 2022

N5N3 commented Apr 17, 2022 •

edited

Loading

nlw0 commented Apr 20, 2022 •

edited

Loading

jakobnissen commented May 4, 2022

jakobnissen commented May 4, 2022

Reducing over bool array fails to optimize somehow #43501

Reducing over bool array fails to optimize somehow #43501

Comments

jakobnissen commented Dec 21, 2021 • edited Loading

stev47 commented Jan 13, 2022

nlw0 commented Apr 16, 2022

nlw0 commented Apr 16, 2022

nlw0 commented Apr 16, 2022 • edited Loading

gbaraldi commented Apr 16, 2022

N5N3 commented Apr 17, 2022 • edited Loading

nlw0 commented Apr 20, 2022 • edited Loading

jakobnissen commented May 4, 2022

jakobnissen commented May 4, 2022

jakobnissen commented Dec 21, 2021 •

edited

Loading

nlw0 commented Apr 16, 2022 •

edited

Loading

N5N3 commented Apr 17, 2022 •

edited

Loading

nlw0 commented Apr 20, 2022 •

edited

Loading