faster maximum, minimum (#28849) #30320

jw3126 · 2018-12-08T21:50:08Z

Borrowing an idea from @antoine-levitt #26003 (comment).
~~There is one slight difference to the current maximum/minimum behavior. Namely -0.0 and 0.0 are not always ordered correctly (the same thing happens with numpy.):~~

Old

julia> using BenchmarkTools; arr = randn(10^6); @benchmark maximum($arr)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     5.818 ms (0.00% GC)
  median time:      6.048 ms (0.00% GC)
  mean time:        6.054 ms (0.00% GC)
  maximum time:     6.973 ms (0.00% GC)
  --------------
  samples:          825
  evals/sample:     1

Numpy

In [1]: import numpy as np

In [2]: arr = np.random.randn(10**6);

In [3]: %timeit np.max(arr)
429 µs ± 2.25 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

New

julia> using BenchmarkTools; arr = randn(10^6); @benchmark maximum($arr)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.215 ms (0.00% GC)
  median time:      1.264 ms (0.00% GC)
  mean time:        1.265 ms (0.00% GC)
  maximum time:     1.420 ms (0.00% GC)
  --------------
  samples:          3930
  evals/sample:     1

New

julia> using BenchmarkTools; arr = randn(10^6); @benchmark maximum($arr)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     909.563 μs (0.00% GC)
  median time:      937.093 μs (0.00% GC)
  mean time:        947.286 μs (0.00% GC)
  maximum time:     1.398 ms (0.00% GC)
  --------------
  samples:          5234
  evals/sample:     1

New

julia> using BenchmarkTools; arr = randn(10^6); @benchmark maximum($arr)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     566.788 μs (0.00% GC)
  median time:      582.718 μs (0.00% GC)
  mean time:        590.280 μs (0.00% GC)
  maximum time:     880.178 μs (0.00% GC)
  --------------
  samples:          8369
  evals/sample:     1

base/reduce.jl

jw3126 · 2018-12-09T14:25:28Z

Can somebody rerun appveyor?

chethega · 2018-12-09T22:40:04Z

Looking at @code_native, it seems that vectorization fails?

julia> _fmax(x,y) = ifelse(x>y, x, y);
julia> function _fmaximum(arr)
              v=arr[1]
              @simd for i=1:length(arr)
              @inbounds v = _fmax(v,arr[i])
              end
              v
              end
julia> function _fmaximum2(arr)
              v1 = arr[1]
              v2 = arr[1]
              v3 = arr[1]
              v4=arr[1]
              isnan_ = isnan(v1) | isnan(v2)
              @simd for i=1:4:length(arr)
              @inbounds v1 = _fmax(v1,arr[i])
              @inbounds v2 = _fmax(v1,arr[i+1])
              @inbounds v3 = _fmax(v1,arr[i+2])
              @inbounds v4 = _fmax(v1,arr[i+3])
              isnan_ |= isnan(v1) | isnan(v2) | isnan(v3) | isnan(v4)
              end
              !isnan_ && return convert(typeof(v1),NaN)
              max(v1,v2,v3,v4)
              end

julia> function _fmaximum3(arr)
              v1 = arr[1]
              v2 = arr[1]
              v3 = arr[1]
              v4=arr[1]
              #isnan_ = isnan(v1) | isnan(v2)
              @simd for i=1:4:length(arr)
              @inbounds v1 = _fmax(v1,arr[i])
              @inbounds v2 = _fmax(v1,arr[i+1])
              @inbounds v3 = _fmax(v1,arr[i+2])
              @inbounds v4 = _fmax(v1,arr[i+3])
              #isnan_ |= isnan(v1) | isnan(v2) | isnan(v3) | isnan(v4)
              end
              #!isnan_ && return convert(typeof(v1),NaN)
              max(v1,v2,v3,v4)
              end


julia> using BenchmarkTools
julia> arr = randn(10^6);
julia> @btime maximum($arr);
  7.640 ms (0 allocations: 0 bytes)

julia> @btime _fmaximum($arr);
  1.341 ms (0 allocations: 0 bytes)

julia> @btime _fmaximum2($arr);
  1.021 ms (0 allocations: 0 bytes)

julia> @btime _fmaximum3($arr);
  537.057 μs (0 allocations: 0 bytes)

julia> arr = randn(Float32,10^6);

julia> @btime maximum($arr);
  7.233 ms (0 allocations: 0 bytes)

julia> @btime _fmaximum($arr);
  1.323 ms (0 allocations: 0 bytes)

julia> @btime _fmaximum2($arr);
  1.003 ms (0 allocations: 0 bytes)

julia> @btime _fmaximum3($arr);
  340.281 μs (0 allocations: 0 bytes)

Can we get properly vectorized (SIMD) code here? Does LLVM understand that min/max are associative?

jw3126 · 2018-12-10T07:49:15Z

Oh nice @chethega. The timing of _fmaximum3 looks like that of numpy. So I guess it uses SIMD? Should I update the PR with such a code? Or is this a compiler glitch, thats easy to fix?

chethega · 2018-12-10T16:17:05Z

I don't know. Your PR gives a nice 5x boost, but there is another big factor on the table.

I guess some llvm expert needs to chime in on associativity of min/max algebra; did we write this in a bad way, are we missing an optimization pass or is this a problem of llvm? Or is this just stuck upstream and will solve itself soon?

If the latter, then I'd say merge and keep an issue open to revisit this later, once the fundamentals in llvm are ready. Nobody got grey hair waiting for the slow maximum yet, so people can enjoy the 5x speedup until they get the next 4x; salami tactics yay.

Second point: I don't understand why your implementation is correct with NaN.

julia> fm(x,y)=ifelse(x > y, x, y)
julia> A=[1.0, NaN, 2.0];
julia> fm(fm(A[1],A[2]), A[3])
2.0

Does your implementation handle that correctly? Regardless, could you add that test-case, just for my piece of mind? I think this is an example that any future PR can plausibly get wrong, which makes it a good test (even better test: [1.0, 2.0, NaN, 4.0, 5.0]).

jw3126 · 2018-12-10T16:36:09Z

Yes the implentation should be NaN correcect, there are also tests. See here. The trick is to ensure, that all NaN are either the second argument or checked explicitly.

julia> fm(x,y)=ifelse(x > y, x, y)
fm (generic function with 1 method)

julia> A=[1.0, NaN, 2.0];

julia> fm(fm(A[1],A[2]), A[3])
2.0

julia> fm(A[3], fm(A[1],A[2]))
NaN

chethega · 2018-12-10T17:18:21Z

My problem is that I don't see how you ensure that. If you fold a NaN into a number, you get a NaN; good. But if you fold a number into a NaN, then you get the number, and since you only check every chunk_len, one of the NaNs could escape.

julia> @eval Base _fast(::typeof(max), x, y) = ifelse(x > y, x, y)
julia> @eval Base _fast(::typeof(min), x, y) = ifelse(x < y, x, y)

julia> @eval Base function mapreduce_impl(f::ft, op::Union{typeof(max), typeof(min)},A::AbstractArray, first::Int, last::Int) where ft
        a1 = @inbounds A[first]
        v = mapreduce_first(f, op, a1)
        chunk_len = 256
        stop = first
        while last - stop > 0
        v == v || return v
        start = stop + 1
        stop = min(stop + chunk_len, last)
        for i in start:stop
        @inbounds ai = A[i]
        v = _fast(op, v, f(ai))
        end
        end
        v
        end

julia> A=rand(5); A[3]=NaN; maximum(A)
NaN
julia> A=rand(1000); A[400]=NaN; maximum(A)
0.9991780990981842

jw3126 · 2018-12-10T20:10:55Z

You are right @chethega the current state of the PR is incorrect.

jw3126 · 2018-12-10T20:27:12Z

Should be fixed (without changing speed of PR).

chethega · 2018-12-10T21:12:18Z

That looks correct, but gives almost a 2x slowdown compared to the original PR :(

Somewhat faster variant for floats on my machine is

julia> function _maximum(a::AbstractArray{T}) where T
       v0=typemin(T)
       i1 = false
       rs = mapreduce(_lift, _red, a; init=(v0,i1))
       rs[2] && return T(NaN)
       rs[1]
       end
julia> fmax(x,y)=ifelse(x > y, x, y)
julia> _lift(x)=(x, isnan(x));
julia> _red(a,b)=( (v1,i1) = a; (v2,i2)=b; (fmax(v1,v2),i1|i2) )
julia> @btime _maximum($arr)
  1.621 ms (0 allocations: 0 bytes)

I think this is because (1) the above fmax is a hardware instruction, and (2) we are latency constrained. That is, apart from the maybe upstream issue of using SIMD / vector units, we have all these nice execution ports idling around, because the code is written such that we have one giant dependency chain. The above variant breaks at least the nan-check out of that dependency chain.

jw3126 · 2018-12-10T21:19:55Z

That is weird, on my machine, it is the same speed as original PR.

jw3126 · 2018-12-10T22:28:59Z

One question is, whether it is okay to confuse the order of 0.0 and -0.0. E.g is it acceptable that things like

maximum([0.0, -0.0]) === -0.0

happen? In numpy this can happen. AFAICT without this PR this does not happen, but is not covered by unit tests.
If we forbid this, we can implement it in a post processing step that only runs if maximum would return -0.0 otherwise. This way there is only performance cost in the corner cases, where this actually matters.

haampie · 2018-12-11T19:25:12Z

As a user I would expect that maximum(v) is defined as (a linear-time version of) sort(v)[end]. The default order of arrays of floats is the total order induced by isless, which is -Inf < -1.0 < -0.0 < 0.0 < 1.0 < Inf < NaN. So isless(0.0, -0.0) == false.

(In an ideal case I would like to be able to pass an Base.Order.Ordering instance to maximum such that I can actually tell what I mean by maximum. See also https://discourse.julialang.org/t/taking-sorting-ordering-seriously/14975)

chethega · 2018-12-11T22:15:12Z

Nice, no more objections from me. We might want a @fastmath as well, for skipping nan-handling and signed zeros. I think passing an Ordering is a cool idea but should be a separate feature PR.

Can we get a nanosoldier run? Just in case of surprises?

KristofferC · 2018-12-11T22:25:40Z

@nanosoldier runbenchmarks(ALL, vs = ":master")

nanosoldier · 2018-12-12T05:32:24Z

Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. cc @ararslan

jw3126 · 2018-12-12T08:01:08Z

Any ideas why CI fails?

fredrikekre · 2018-12-12T08:11:40Z

Any ideas why CI fails?

#30356

jw3126 · 2018-12-12T22:10:42Z

Can somebody rerun appveyor?

jw3126 · 2018-12-18T06:08:30Z

Any objections left?

nalimilan · 2018-12-18T17:24:56Z

base/reduce.jl

+    start = first
+    stop  = start + chunk_len - 4
+    while stop <= last
+        isnan(v1) && return v1


This broke maximum on non-numeric values. This is visible only with long arrays:

julia> maximum(rand(Char, 500)) ERROR: MethodError: no method matching isnan(::Char) Closest candidates are: isnan(::BigFloat) at mpfr.jl:879 isnan(::Missing) at missing.jl:83 isnan(::Float16) at float.jl:530 ... Stacktrace: [1] mapreduce_impl(::typeof(identity), ::typeof(max), ::Array{Char,1}, ::Int64, ::Int64) at ./reduce.jl:482 [2] _mapreduce at ./reduce.jl:320 [inlined] [3] _mapreduce_dim at ./reducedim.jl:308 [inlined] [4] #mapreduce#549 at ./reducedim.jl:304 [inlined] [5] mapreduce at ./reducedim.jl:304 [inlined] [6] _maximum at ./reducedim.jl:653 [inlined] [7] _maximum at ./reducedim.jl:652 [inlined] [8] #maximum#555 at ./reducedim.jl:648 [inlined] [9] maximum(::Array{Char,1}) at ./reducedim.jl:648 [10] top-level scope at none:0

KristofferC · 2018-12-18T17:34:34Z

Right now we are only backporting bugfixes to 1.1.0. This can go in 1.1.1.

ViralBShah · 2018-12-18T18:18:14Z

Reverting due to the reported breakage. We should also add the non-numeric case to the tests.

This reverts commit 7b2db8e.

Close #30436 Address the non-numeric values case that is visible only with long arrays reported in #30320

faster maximum, minimum (JuliaLang#28849)

2b49100

jw3126 commented Dec 8, 2018

View reviewed changes

base/reduce.jl Outdated Show resolved Hide resolved

KristofferC reviewed Dec 8, 2018

View reviewed changes

base/reduce.jl Outdated Show resolved Hide resolved

fix

ae2f40e

StefanKarpinski approved these changes Dec 9, 2018

View reviewed changes

KristofferC added performance Must go faster backport pending 1.0 labels Dec 9, 2018

KristofferC mentioned this pull request Dec 10, 2018

Backports for 1.1.0 #30309

Merged

52 tasks

fix

c1f5569

add max_maximum and min_minimum

1ca74c5

jw3126 force-pushed the stats branch from 8fd16ef to 1ca74c5 Compare December 10, 2018 22:30

c42f mentioned this pull request Dec 11, 2018

sum(a) is now 30% slower than NumPy #30290

Closed

fix

d84337f

fredrikekre closed this Dec 12, 2018

fredrikekre reopened this Dec 12, 2018

fix

afbd564

ViralBShah merged commit 7b2db8e into JuliaLang:master Dec 18, 2018

nalimilan reviewed Dec 18, 2018

View reviewed changes

KristofferC removed the backport pending 1.1 label Dec 18, 2018

ViralBShah added a commit that referenced this pull request Dec 18, 2018

Revert "Fix #28849: faster maximum, minimum (#30320)"

128b4ce

This reverts commit 7b2db8e.

ViralBShah mentioned this pull request Dec 18, 2018

Revert "faster maximum, minimum (#28849)" #30436

Closed

jw3126 mentioned this pull request Dec 18, 2018

fix maximum on generic orders #30441

Merged

ViralBShah pushed a commit that referenced this pull request Dec 20, 2018

fix maximum on generic orders (#30441)

744ead4

Close #30436 Address the non-numeric values case that is visible only with long arrays reported in #30320

ViralBShah added backport pending 1.1 and removed backport pending 1.1 labels Dec 20, 2018

tkoolen mentioned this pull request Dec 20, 2018

BoundsError on master but works fine on 1.0.3 (and previous) #30462

Closed

ViralBShah added Backport pending 1.1.1 and removed backport pending 1.0 labels Dec 20, 2018

mbauman mentioned this pull request Dec 21, 2018

Fix #30462, out of bounds access in maximum/minimum for certain-sized arrays #30484

Merged

StefanKarpinski added backport 1.1 triage This should be discussed on a triage call and removed backport 1.1 labels Jan 31, 2019

ararslan mentioned this pull request Mar 18, 2019

The default mapreduce_impl doesn't work for non-Bool == #31390

Closed

mbauman mentioned this pull request Mar 22, 2019

Extrema is slower than maximum + minimum #31442

Open

nlw0 mentioned this pull request Apr 18, 2019

likely vectorization discrepancy between julia and clang triple-nested-loop gemms #29445

Open

nlw0 mentioned this pull request Apr 28, 2019

Implements the @nonans decorator to enable some LLVM vectorizations #31862

Open

Keno removed the triage This should be discussed on a triage call label Feb 27, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

faster maximum, minimum (#28849) #30320

faster maximum, minimum (#28849) #30320

jw3126 commented Dec 8, 2018 •

edited

Loading

jw3126 commented Dec 9, 2018

chethega commented Dec 9, 2018

jw3126 commented Dec 10, 2018

chethega commented Dec 10, 2018

jw3126 commented Dec 10, 2018 •

edited

Loading

chethega commented Dec 10, 2018

jw3126 commented Dec 10, 2018

jw3126 commented Dec 10, 2018 •

edited

Loading

chethega commented Dec 10, 2018

jw3126 commented Dec 10, 2018

jw3126 commented Dec 10, 2018

haampie commented Dec 11, 2018 •

edited

Loading

chethega commented Dec 11, 2018

KristofferC commented Dec 11, 2018

nanosoldier commented Dec 12, 2018

jw3126 commented Dec 12, 2018

fredrikekre commented Dec 12, 2018

jw3126 commented Dec 12, 2018

jw3126 commented Dec 18, 2018

nalimilan Dec 18, 2018

KristofferC commented Dec 18, 2018

ViralBShah commented Dec 18, 2018 •

edited

Loading

faster maximum, minimum (#28849) #30320

faster maximum, minimum (#28849) #30320

Conversation

jw3126 commented Dec 8, 2018 • edited Loading

Old

Numpy

New

New

New

jw3126 commented Dec 9, 2018

chethega commented Dec 9, 2018

jw3126 commented Dec 10, 2018

chethega commented Dec 10, 2018

jw3126 commented Dec 10, 2018 • edited Loading

chethega commented Dec 10, 2018

jw3126 commented Dec 10, 2018

jw3126 commented Dec 10, 2018 • edited Loading

chethega commented Dec 10, 2018

jw3126 commented Dec 10, 2018

jw3126 commented Dec 10, 2018

haampie commented Dec 11, 2018 • edited Loading

chethega commented Dec 11, 2018

KristofferC commented Dec 11, 2018

nanosoldier commented Dec 12, 2018

jw3126 commented Dec 12, 2018

fredrikekre commented Dec 12, 2018

jw3126 commented Dec 12, 2018

jw3126 commented Dec 18, 2018

nalimilan Dec 18, 2018

Choose a reason for hiding this comment

KristofferC commented Dec 18, 2018

ViralBShah commented Dec 18, 2018 • edited Loading

jw3126 commented Dec 8, 2018 •

edited

Loading

jw3126 commented Dec 10, 2018 •

edited

Loading

jw3126 commented Dec 10, 2018 •

edited

Loading

haampie commented Dec 11, 2018 •

edited

Loading

ViralBShah commented Dec 18, 2018 •

edited

Loading