Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reducing over bool array fails to optimize somehow #43501

Open
jakobnissen opened this issue Dec 21, 2021 · 9 comments
Open

Reducing over bool array fails to optimize somehow #43501

jakobnissen opened this issue Dec 21, 2021 · 9 comments
Labels
fold sum, maximum, reduce, foldl, etc. performance Must go faster

Comments

@jakobnissen
Copy link
Contributor

jakobnissen commented Dec 21, 2021

Minimal example

julia> using BenchmarkTools

julia> function foo(x)
           y = false
           @inbounds for i in eachindex(x)
               y |= x[i]
           end
           y
       end
foo (generic function with 1 method)

julia> data = fill(false, 2^12);

julia> @btime reduce(|, data)
  300.850 ns (0 allocations: 0 bytes)
false

julia> @btime reduce(|, data, init=false)
  166.408 ns (0 allocations: 0 bytes)
false

julia> @btime foo(data)
  26.813 ns (0 allocations: 0 bytes)
false

Shouldn't reduce without the init keyword first find the initial value, then call the second method? In that case, it should not be 5x slower.

Edit: Also, for some reason, this does not vectorize.

@stev47
Copy link
Contributor

stev47 commented Jan 13, 2022

The keyword-variant calls foldl_impl and the non-keyword variant apparently calls mapreduce_impl, maybe @tkf can comment?

@stev47 stev47 added fold sum, maximum, reduce, foldl, etc. performance Must go faster labels Jan 13, 2022
@nlw0
Copy link
Contributor

nlw0 commented Apr 16, 2022

A similar experiment with + and Int64 also gives a similar result:

julia> using BenchmarkTools

julia> function foo(x)
           y = 0
           @inbounds for i in eachindex(x)
               y += x[i]
           end
           y
       end
foo (generic function with 1 method)

julia> dataint = fill(0, 2^12);

julia> @btime reduce(+, dataint)
  445.783 ns (0 allocations: 0 bytes)
0

julia> @btime reduce(+, dataint, init=0)
  443.990 ns (0 allocations: 0 bytes)
0

julia> @btime foo(dataint)
  233.726 ns (0 allocations: 0 bytes)
0

@nlw0
Copy link
Contributor

nlw0 commented Apr 16, 2022

Even sum doesn't beat the for-loop for integers, although it does for floats (I'm even worried about the float sum performance with for loops!)

julia> @btime mapreduce(identity, +, dataint)
  527.974 ns (0 allocations: 0 bytes)
0

julia> @btime sum(dataint)
  354.445 ns (0 allocations: 0 bytes)
0

julia> @btime fooint(dataint)
  205.389 ns (0 allocations: 0 bytes)
0

I'm not familiar with how these functions are implemented, but the impression I got is that they eventually call a Base._foldl_impl method that is generic, based on iterate, and perhaps that doesn't get optimized. Is there a specialized version for arrays that should have been called instead?

@nlw0
Copy link
Contributor

nlw0 commented Apr 16, 2022

Not sure I made any mistakes here. Like I said, I'm not too familiar with this code.

Tracing back what function is being called exactly, this is what I found using @edit, and the timing is consistent:

@btime foobool($databool) # 41us
@btime reduce(|, $databool) # 520ns
@edit reduce(|,databool)
@btime mapreduce(identity, |, $databool)
@edit mapreduce(identity, |, databool)
@btime Base._mapreduce_dim(identity, |, Base._InitialValue(), $databool, :)
@edit Base._mapreduce_dim(identity, |, Base._InitialValue(), databool, :)
@btime Base._mapreduce(identity, |, IndexStyle($databool), $databool)# 520ns
@edit Base._mapreduce(identity, |, IndexStyle(databool), databool)

That would mean we are using an implementation of mapreduce that performs a binary-tree like traversal of the data, instead of linear. I wasn't sure yet this was the case when I noticed this function, so I opened this other issue #45000

Here's a measurement of the time difference between reduce and the for-loop for |. There's a plot for + and Int64 on that issue.

image

My guess is that it's overhead from the algorithm, and it gets exacerbated in the boolean case. I suppose Bool and Int do not benefit from the pairwise summation scheme. It would probably be nice to make sure this algorithm is only used for floating-point element types, while anything else uses a simple linear traversal approach. Or otherwise, there should be a version of mapreduce_impl and _mapreduce that get used for element types that do not benefit from pairwise summation schemes, and then we can specialize for them.

Alternatively, we could increase the block size based on the operations, but also on the element type. Although _mapreduce has a fixed threshold of 16 for linear traversal, so it doesn't solve this boolean case. That assuming it's really just about replacing this method, and probably call mapfoldl as @simeonschaub mentions in #45000. That only really solves it for Int, though, still leaving a puzzle in the Bool case.

julia> @btime mapfoldl(identity, +, $dataint);
  375.175 μs (0 allocations: 0 bytes)
julia> @btime fooint($dataint);
  367.436 μs (0 allocations: 0 bytes)
julia> @btime mapfoldl(identity, |, $databool);
  105.097 ns (0 allocations: 0 bytes)
julia> @btime foobool($databool);
  41.287 ns (0 allocations: 0 bytes)

@gbaraldi
Copy link
Member

Folds all have some weird performance issues. See #43310

@N5N3
Copy link
Member

N5N3 commented Apr 17, 2022

The pairwise reduction could be turned off by

Base.pairwise_blocksize(f, ::typeof(|)) = typemax(Int) # enlarge the threshold to avoid split.

then we have

julia> @btime reduce(|, $data)
  88.438 ns (0 allocations: 0 bytes)  # 268.111 ns before.

Still slower than foo, but much better.

The remaining difference comes from the initialization: foo use false, while mapreduce_impl use data[1]|data[2].
The following is a MWE

function foo(x)
    y = false
    @inbounds for i in 1:length(x)
        y |= x[i]
    end
    y
end
function foo2(x)
    @inbounds y = x[1]|x[2]
    @inbounds for i in 3:length(x)
        y |= x[i]
    end
    y
end
@btime foo2($data)  #   83.212 ns (0 allocations: 0 bytes)
@btime foo($data)  # 37.059 ns (0 allocations: 0 bytes)

@nlw0
Copy link
Contributor

nlw0 commented Apr 20, 2022

@N5N3 this is great. I fear the problem may be even larger than doing that for this one operation, though. Would you mind commenting on #45000 ? Perhaps I should not have created a second issue. But the point is I believe if we start looking around, we might actually like to raise this block length to many more operations. I don't think there are many desirable applications for that pairwise mapreduce_impl other than accurate summation with floating-point compared to all the general uses for reduce and fold, and mapreduce_impl seem to be used in most cases. Furthermore, there may actually be better alternatives for accurate summation, something that perhaps is only really viable with more modern processors. Anyways, this suggests we might like to make a larger refactoring that just tuning this parameter for a few more cases such as | with Bool.

@jakobnissen jakobnissen changed the title Reducing over bool array fails to optimize somehow Mapreduce wrongly defaults to blocked pairwise implementation May 4, 2022
@jakobnissen
Copy link
Contributor Author

Okay, it's clear the issue is that mapreduce defaults to the pairwise blocked implementation, which makes sense for floats, but not for other data types.

@jakobnissen jakobnissen changed the title Mapreduce wrongly defaults to blocked pairwise implementation Reducing over bool array fails to optimize somehow May 4, 2022
@jakobnissen
Copy link
Contributor Author

Whoops looks like I was wrong

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
fold sum, maximum, reduce, foldl, etc. performance Must go faster
Projects
None yet
Development

No branches or pull requests

5 participants