Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

faster maximum, minimum (#28849) #30320

Merged
merged 6 commits into from
Dec 18, 2018
Merged

faster maximum, minimum (#28849) #30320

merged 6 commits into from
Dec 18, 2018

Conversation

jw3126
Copy link
Contributor

@jw3126 jw3126 commented Dec 8, 2018

Borrowing an idea from @antoine-levitt #26003 (comment).
There is one slight difference to the current maximum/minimum behavior. Namely -0.0 and 0.0 are not always ordered correctly (the same thing happens with numpy.):

Old

julia> using BenchmarkTools; arr = randn(10^6); @benchmark maximum($arr)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     5.818 ms (0.00% GC)
  median time:      6.048 ms (0.00% GC)
  mean time:        6.054 ms (0.00% GC)
  maximum time:     6.973 ms (0.00% GC)
  --------------
  samples:          825
  evals/sample:     1

Numpy

In [1]: import numpy as np

In [2]: arr = np.random.randn(10**6);

In [3]: %timeit np.max(arr)
429 µs ± 2.25 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

New

julia> using BenchmarkTools; arr = randn(10^6); @benchmark maximum($arr)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.215 ms (0.00% GC)
  median time:      1.264 ms (0.00% GC)
  mean time:        1.265 ms (0.00% GC)
  maximum time:     1.420 ms (0.00% GC)
  --------------
  samples:          3930
  evals/sample:     1

New

julia> using BenchmarkTools; arr = randn(10^6); @benchmark maximum($arr)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     909.563 μs (0.00% GC)
  median time:      937.093 μs (0.00% GC)
  mean time:        947.286 μs (0.00% GC)
  maximum time:     1.398 ms (0.00% GC)
  --------------
  samples:          5234
  evals/sample:     1

New

julia> using BenchmarkTools; arr = randn(10^6); @benchmark maximum($arr)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     566.788 μs (0.00% GC)
  median time:      582.718 μs (0.00% GC)
  mean time:        590.280 μs (0.00% GC)
  maximum time:     880.178 μs (0.00% GC)
  --------------
  samples:          8369
  evals/sample:     1

base/reduce.jl Outdated Show resolved Hide resolved
base/reduce.jl Outdated Show resolved Hide resolved
@jw3126
Copy link
Contributor Author

jw3126 commented Dec 9, 2018

Can somebody rerun appveyor?

@chethega
Copy link
Contributor

chethega commented Dec 9, 2018

Looking at @code_native, it seems that vectorization fails?

julia> _fmax(x,y) = ifelse(x>y, x, y);
julia> function _fmaximum(arr)
              v=arr[1]
              @simd for i=1:length(arr)
              @inbounds v = _fmax(v,arr[i])
              end
              v
              end
julia> function _fmaximum2(arr)
              v1 = arr[1]
              v2 = arr[1]
              v3 = arr[1]
              v4=arr[1]
              isnan_ = isnan(v1) | isnan(v2)
              @simd for i=1:4:length(arr)
              @inbounds v1 = _fmax(v1,arr[i])
              @inbounds v2 = _fmax(v1,arr[i+1])
              @inbounds v3 = _fmax(v1,arr[i+2])
              @inbounds v4 = _fmax(v1,arr[i+3])
              isnan_ |= isnan(v1) | isnan(v2) | isnan(v3) | isnan(v4)
              end
              !isnan_ && return convert(typeof(v1),NaN)
              max(v1,v2,v3,v4)
              end

julia> function _fmaximum3(arr)
              v1 = arr[1]
              v2 = arr[1]
              v3 = arr[1]
              v4=arr[1]
              #isnan_ = isnan(v1) | isnan(v2)
              @simd for i=1:4:length(arr)
              @inbounds v1 = _fmax(v1,arr[i])
              @inbounds v2 = _fmax(v1,arr[i+1])
              @inbounds v3 = _fmax(v1,arr[i+2])
              @inbounds v4 = _fmax(v1,arr[i+3])
              #isnan_ |= isnan(v1) | isnan(v2) | isnan(v3) | isnan(v4)
              end
              #!isnan_ && return convert(typeof(v1),NaN)
              max(v1,v2,v3,v4)
              end


julia> using BenchmarkTools
julia> arr = randn(10^6);
julia> @btime maximum($arr);
  7.640 ms (0 allocations: 0 bytes)

julia> @btime _fmaximum($arr);
  1.341 ms (0 allocations: 0 bytes)

julia> @btime _fmaximum2($arr);
  1.021 ms (0 allocations: 0 bytes)

julia> @btime _fmaximum3($arr);
  537.057 μs (0 allocations: 0 bytes)

julia> arr = randn(Float32,10^6);

julia> @btime maximum($arr);
  7.233 ms (0 allocations: 0 bytes)

julia> @btime _fmaximum($arr);
  1.323 ms (0 allocations: 0 bytes)

julia> @btime _fmaximum2($arr);
  1.003 ms (0 allocations: 0 bytes)

julia> @btime _fmaximum3($arr);
  340.281 μs (0 allocations: 0 bytes)

Can we get properly vectorized (SIMD) code here? Does LLVM understand that min/max are associative?

@jw3126
Copy link
Contributor Author

jw3126 commented Dec 10, 2018

Oh nice @chethega. The timing of _fmaximum3 looks like that of numpy. So I guess it uses SIMD? Should I update the PR with such a code? Or is this a compiler glitch, thats easy to fix?

@KristofferC KristofferC mentioned this pull request Dec 10, 2018
52 tasks
@chethega
Copy link
Contributor

I don't know. Your PR gives a nice 5x boost, but there is another big factor on the table.

I guess some llvm expert needs to chime in on associativity of min/max algebra; did we write this in a bad way, are we missing an optimization pass or is this a problem of llvm? Or is this just stuck upstream and will solve itself soon?

If the latter, then I'd say merge and keep an issue open to revisit this later, once the fundamentals in llvm are ready. Nobody got grey hair waiting for the slow maximum yet, so people can enjoy the 5x speedup until they get the next 4x; salami tactics yay.

Second point: I don't understand why your implementation is correct with NaN.

julia> fm(x,y)=ifelse(x > y, x, y)
julia> A=[1.0, NaN, 2.0];
julia> fm(fm(A[1],A[2]), A[3])
2.0

Does your implementation handle that correctly? Regardless, could you add that test-case, just for my piece of mind? I think this is an example that any future PR can plausibly get wrong, which makes it a good test (even better test: [1.0, 2.0, NaN, 4.0, 5.0]).

@jw3126
Copy link
Contributor Author

jw3126 commented Dec 10, 2018

Yes the implentation should be NaN correcect, there are also tests. See here. The trick is to ensure, that all NaN are either the second argument or checked explicitly.

julia> fm(x,y)=ifelse(x > y, x, y)
fm (generic function with 1 method)

julia> A=[1.0, NaN, 2.0];

julia> fm(fm(A[1],A[2]), A[3])
2.0

julia> fm(A[3], fm(A[1],A[2]))
NaN

@chethega
Copy link
Contributor

My problem is that I don't see how you ensure that. If you fold a NaN into a number, you get a NaN; good. But if you fold a number into a NaN, then you get the number, and since you only check every chunk_len, one of the NaNs could escape.

julia> @eval Base _fast(::typeof(max), x, y) = ifelse(x > y, x, y)
julia> @eval Base _fast(::typeof(min), x, y) = ifelse(x < y, x, y)

julia> @eval Base function mapreduce_impl(f::ft, op::Union{typeof(max), typeof(min)},A::AbstractArray, first::Int, last::Int) where ft
        a1 = @inbounds A[first]
        v = mapreduce_first(f, op, a1)
        chunk_len = 256
        stop = first
        while last - stop > 0
        v == v || return v
        start = stop + 1
        stop = min(stop + chunk_len, last)
        for i in start:stop
        @inbounds ai = A[i]
        v = _fast(op, v, f(ai))
        end
        end
        v
        end

julia> A=rand(5); A[3]=NaN; maximum(A)
NaN
julia> A=rand(1000); A[400]=NaN; maximum(A)
0.9991780990981842

@jw3126
Copy link
Contributor Author

jw3126 commented Dec 10, 2018

You are right @chethega the current state of the PR is incorrect.

@jw3126
Copy link
Contributor Author

jw3126 commented Dec 10, 2018

Should be fixed (without changing speed of PR).

@chethega
Copy link
Contributor

That looks correct, but gives almost a 2x slowdown compared to the original PR :(

Somewhat faster variant for floats on my machine is

julia> function _maximum(a::AbstractArray{T}) where T
       v0=typemin(T)
       i1 = false
       rs = mapreduce(_lift, _red, a; init=(v0,i1))
       rs[2] && return T(NaN)
       rs[1]
       end
julia> fmax(x,y)=ifelse(x > y, x, y)
julia> _lift(x)=(x, isnan(x));
julia> _red(a,b)=( (v1,i1) = a; (v2,i2)=b; (fmax(v1,v2),i1|i2) )
julia> @btime _maximum($arr)
  1.621 ms (0 allocations: 0 bytes)

I think this is because (1) the above fmax is a hardware instruction, and (2) we are latency constrained. That is, apart from the maybe upstream issue of using SIMD / vector units, we have all these nice execution ports idling around, because the code is written such that we have one giant dependency chain. The above variant breaks at least the nan-check out of that dependency chain.

@jw3126
Copy link
Contributor Author

jw3126 commented Dec 10, 2018

That is weird, on my machine, it is the same speed as original PR.

@jw3126
Copy link
Contributor Author

jw3126 commented Dec 10, 2018

One question is, whether it is okay to confuse the order of 0.0 and -0.0. E.g is it acceptable that things like

maximum([0.0, -0.0]) === -0.0

happen? In numpy this can happen. AFAICT without this PR this does not happen, but is not covered by unit tests.
If we forbid this, we can implement it in a post processing step that only runs if maximum would return -0.0 otherwise. This way there is only performance cost in the corner cases, where this actually matters.

@haampie
Copy link
Contributor

haampie commented Dec 11, 2018

As a user I would expect that maximum(v) is defined as (a linear-time version of) sort(v)[end]. The default order of arrays of floats is the total order induced by isless, which is -Inf < -1.0 < -0.0 < 0.0 < 1.0 < Inf < NaN. So isless(0.0, -0.0) == false.

(In an ideal case I would like to be able to pass an Base.Order.Ordering instance to maximum such that I can actually tell what I mean by maximum. See also https://discourse.julialang.org/t/taking-sorting-ordering-seriously/14975)

@chethega
Copy link
Contributor

Nice, no more objections from me. We might want a @fastmath as well, for skipping nan-handling and signed zeros. I think passing an Ordering is a cool idea but should be a separate feature PR.

Can we get a nanosoldier run? Just in case of surprises?

@KristofferC
Copy link
Member

@nanosoldier runbenchmarks(ALL, vs = ":master")

@nanosoldier
Copy link
Collaborator

Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. cc @ararslan

@jw3126
Copy link
Contributor Author

jw3126 commented Dec 12, 2018

Any ideas why CI fails?

@fredrikekre fredrikekre reopened this Dec 12, 2018
@fredrikekre
Copy link
Member

Any ideas why CI fails?

#30356

@jw3126
Copy link
Contributor Author

jw3126 commented Dec 12, 2018

Can somebody rerun appveyor?

@jw3126
Copy link
Contributor Author

jw3126 commented Dec 18, 2018

Any objections left?

@ViralBShah ViralBShah merged commit 7b2db8e into JuliaLang:master Dec 18, 2018
start = first
stop = start + chunk_len - 4
while stop <= last
isnan(v1) && return v1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This broke maximum on non-numeric values. This is visible only with long arrays:

julia> maximum(rand(Char, 500))
ERROR: MethodError: no method matching isnan(::Char)
Closest candidates are:
  isnan(::BigFloat) at mpfr.jl:879
  isnan(::Missing) at missing.jl:83
  isnan(::Float16) at float.jl:530
  ...
Stacktrace:
 [1] mapreduce_impl(::typeof(identity), ::typeof(max), ::Array{Char,1}, ::Int64, ::Int64) at ./reduce.jl:482
 [2] _mapreduce at ./reduce.jl:320 [inlined]
 [3] _mapreduce_dim at ./reducedim.jl:308 [inlined]
 [4] #mapreduce#549 at ./reducedim.jl:304 [inlined]
 [5] mapreduce at ./reducedim.jl:304 [inlined]
 [6] _maximum at ./reducedim.jl:653 [inlined]
 [7] _maximum at ./reducedim.jl:652 [inlined]
 [8] #maximum#555 at ./reducedim.jl:648 [inlined]
 [9] maximum(::Array{Char,1}) at ./reducedim.jl:648
 [10] top-level scope at none:0

@KristofferC
Copy link
Member

Right now we are only backporting bugfixes to 1.1.0. This can go in 1.1.1.

@ViralBShah
Copy link
Member

ViralBShah commented Dec 18, 2018

Reverting due to the reported breakage. We should also add the non-numeric case to the tests.

ViralBShah added a commit that referenced this pull request Dec 18, 2018
ViralBShah pushed a commit that referenced this pull request Dec 20, 2018
Close #30436
Address the non-numeric values case that is visible only with long arrays reported in #30320
@StefanKarpinski StefanKarpinski added backport 1.1 triage This should be discussed on a triage call and removed backport 1.1 labels Jan 31, 2019
@Keno Keno removed the triage This should be discussed on a triage call label Feb 27, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Must go faster
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants