Make minmax faster for Float32/64 #41709

N5N3 · 2021-07-26T10:59:02Z

~~This PR tries to fix minmax for BigFloat, as:~~ The allocations in BigFloat's min/max seems disputable. This PR just makes the outputs of minmax consistent with that of min/max.

julia> a = BigFloat(1//3, 300); b = BigFloat(3//5, 300);
julia> min(a, b) === a
false    -->  false
julia> max(a, b) === b
false    -->  false
julia> minmax(a, b) === (a, b)
true     -->  false
julia> min(a, b) == a
false    -->  false
julia> max(a, b) == b
false    -->  false
julia> minmax(a, b) == (a, b)
true     -->  false

At present, if the two inputs are NaNs with different signs:

Base.min always returns the one with positive sign
Base.max always returns the one with negetive sign
Base.minmax(x, y) always returns (x, x)
I just wonder which style is better? This PR make min max also return the first of two NaNs (no matter their signs) like LLVM and Julia's BigFloat.

This PR also makes min/max a little faster for IEEEFloat.
At the same time, minmax for Float32/64 could be proper vectorlized with broadcast now.
Some benchmarks:

julia> a = randn(100); b = randn(100); c = similar(a); d = minmax.(a,b); @btime $c .= min.($a,$b); @btime $c .= max.($a,$b); @btime $d .= minmax.($a,$b); 
  43.448 ns (0 allocations: 0 bytes)   --->   33.494 ns (0 allocations: 0 bytes)
  44.040 ns (0 allocations: 0 bytes)   --->   37.803 ns (0 allocations: 0 bytes)
  233.696 ns (0 allocations: 0 bytes)  --->   52.584 ns (0 allocations: 0 bytes)

julia> a = randn(Float32, 100); b = randn(Float32, 100); c = similar(a); d = minmax.(a,b); @btime $c .= min.($a,$b); @btime $c .= max.($a,$b); @btime $d .= minmax.($a,$b);
  29.819 ns (0 allocations: 0 bytes)   --->    24.172 ns (0 allocations: 0 bytes)
  29.789 ns (0 allocations: 0 bytes)   --->    23.447 ns (0 allocations: 0 bytes)
  223.353 ns (0 allocations: 0 bytes)  --->    40.664 ns (0 allocations: 0 bytes)

julia> a = randn(Float16, 100); b = randn(Float16, 100); c = similar(a); d = minmax.(a,b); @btime $c .= min.($a,$b); @btime $c .= max.($a,$b); @btime $d .= minmax.($a,$b);
  239.862 ns (0 allocations: 0 bytes)  --->   135.865 ns (0 allocations: 0 bytes)
  243.970 ns (0 allocations: 0 bytes)  --->   136.424 ns (0 allocations: 0 bytes)
  206.833 ns (0 allocations: 0 bytes)  --->   200.500ns (0 allocations: 0 bytes)

Seelengrab · 2021-07-26T17:06:43Z

The speedup for Float16/32/64 is nice, but I'm not sure why allocating more for BigFloat is more correct? I'd have expected a non-allocating version to be more in line with what min and max are supposed to do according to their docstrings, namely returning the min/max of its arguments (not a copy thereof).

N5N3 · 2021-07-27T08:16:44Z

Since BigFloat is introduced in Base.MPFR, maybe such change should be done in a seperated PR?
While Base.Math only offers the generic version of AbstractFloat's min/max/minmax, and make them fast for IEEEFloat.

Seelengrab · 2021-07-27T08:40:36Z

If the change in allocations for BigFloat wasn't intended and happened incidentally, I'd be in favor of just doing it in this PR since it touches the same (user facing) API, touching a different file seems fine with me. I think it'd just be a method specialization for BigFloat with a non-allocating version, no? Does MPFR expose some min/max API we can use directly without having to implement it ourselves?

It's possible there's some usercode out there relying on it being !== right now, but I guess we'll find out with the PkgEval runs for the next version.

N5N3 · 2021-07-27T08:57:02Z

I'm not familiar with libmpfr, but Julia now use the 4 args api (2 inputs + output + rounding setting) to calculate the min/max of BigFloat, and the following code return false, as the default prec is 256.

julia> a = BigFloat(1//3,300); b = BigFloat(1//5,300);
julia> min(a, b) == b
false

It seems to be a better example for switching to a non-allocating version?

N5N3 · 2022-01-10T04:45:51Z

I think we'd better pick this PR up. @oscardssmith @tkf
List for triage:

Do we really need allocation in min/max for BigFloat.
Should we treat copysign(NaN, -1) > NaN in min/max for AbstractFloat. (Or just always return the first NaN in inputs like llvm.minimum and Base.minmax)

For reference: the behavior of 2 is undefined in lastest IEEE754.

If two or more inputs are NaN, then the payload of the resulting NaN should be identical to the payload of one of the input NaNs if representable in the destination format. This standard does not specify which of
the input NaNs will provide the payload.

tkf · 2022-01-18T04:37:31Z

I consider min/max to be defined with respect to total orderings ¹ isless and Base.isgreater since this is a crucial property for parallelism. The result happens to coincide with IEEE 754 for floats and the "poisoning" semantics of missing. Since min/max/minimum/maximum/extrema/etc. provide the "fist match" semantics, I think it makes sense to reuse the input, even for BigFloat NaNs.

Or maybe more precisely "partial orderings that are total on the many of the subsets of values people tend to care" ↩

N5N3 · 2022-01-19T02:52:41Z

BigFloat is a little bit complex because of it's "variable" precison.
Should we do promotion in min(x::BigFloat, y::BigFloat) if they have different precision?

tkf · 2022-01-20T02:52:12Z

Hmm.... can the promotion of the precision be added to promote on BigFloats? Though I guess that's too breaking...

N5N3 · 2022-01-20T03:32:24Z

If we do that, I guess all other math calls should be modified:

julia> b = BigFloat(1.0,3000) + BigFloat(1.0,200)
2.0
julia> b.prec
256

Maybe we can avoid allocation in min/max only when all inputs has the default precision?

tkf · 2022-01-20T05:40:40Z

Actually, I have no idea what users would expect about the output precision of BigFloat operations. I'm no expert so I'll shut up :)

N5N3 · 2022-01-20T06:01:24Z

I personally seldom use BigFloat. So let's wait the triage result _(:зゝ∠)_.

JeffBezanson · 2022-01-20T20:06:34Z

From triage: (1) We don't need to copy a BigFloat, ok to return the actual element we find. (2) It doesn't matter which NaN we return when there are multiple.

oscardssmith · 2022-01-20T20:13:18Z

For BigFloat there's no reason to promote (it would just be slower and not more useful). The last thing to check is that this does correct things with -0.0. Is this code path sufficiently tested?

N5N3 · 2022-01-21T03:44:55Z

I extend the current min/max/minmax test for Float64 to all float types in Base. (5dd01a2)
Local test shows no problem.

_extrema_rf related optimization is also added in this PR, since it's min/max related. (I think there's no need to open a new PR)

N5N3 · 2022-01-21T04:07:53Z

Update the lastest Benchmark:

for T in (Float32, Float64, Float16, BigFloat)
    a = T.(randn(128)); b = T.(randn(128)); c = similar(a); 
    d = minmax.(a,b); 
    t1 = @benchmark $c .= min.($a,$b); 
    t2 = @benchmark $c .= max.($a,$b); 
    t3 = @benchmark $d .= minmax.($a,$b);
    print(T, "|  min: ", round(median(t1).time, digits = 2), "ns")
    print("  max: ", round(median(t2).time, digits = 2), "ns")
    println("  minmax: ", round(median(t3).time, digits = 2), "ns")
end

This PR:

Float32|  min: 21.34ns  max: 21.26ns  minmax: 40.73ns
Float64|  min: 34.07ns  max: 34.14ns  minmax: 69.43ns
Float16|  min: 173.6ns  max: 174.18ns  minmax: 223.15ns
BigFloat|  min: 1610.0ns  max: 1540.0ns  minmax: 1630.0ns

Master:

Float32|  min: 28.82ns  max: 28.92ns  minmax: 231.8ns
Float64|  min: 46.87ns  max: 46.77ns  minmax: 243.28ns
Float16|  min: 317.72ns  max: 318.14ns  minmax: 262.72ns
BigFloat|  min: 5216.67ns  max: 5200.0ns  minmax: 2588.89ns

https://discourse.julialang.org/t/faster-min-max-for-float64-32/45464/40

Update mpfr.jl

mikmoore · 2022-06-01T02:56:08Z

base/math.jl


+function minmax(x::T, y::T) where {T<:Union{Float32,Float64}}


Is this detailed definition necessary? It seems like a more generic
minmax(x::T, y::T) where {T<:Union{Float32,Float64}} = min(x, y), max(x, y)
would perform identically thanks to inlining.

My local bench does show some performance difference.

julia> a = randn(1024); b = randn(1024); z = min.(a,b); zz = minmax.(a,b); julia> using BenchmarkTools [ Info: Precompiling BenchmarkTools [6e4b80f9-dd63-53aa-95a3-0cdb28fa8baf] julia> @benchmark $zz .= minmax.($a, $b) BenchmarkTools.Trial: 10000 samples with 198 evaluations. Range (min … max): 440.404 ns … 1.189 μs ┊ GC (min … max): 0.00% … 0.00% Time (median): 442.424 ns ┊ GC (median): 0.00% Time (mean ± σ): 457.627 ns ± 45.207 ns ┊ GC (mean ± σ): 0.00% ± 0.00% █ █▂▃▁▅▃▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▁ 440 ns Histogram: frequency by time 702 ns < Memory estimate: 0 bytes, allocs estimate: 0. julia> f(x, y) = min(x, y), max(x, y) f (generic function with 1 method) julia> @benchmark $zz .= f.($a, $b) BenchmarkTools.Trial: 10000 samples with 194 evaluations. Range (min … max): 497.423 ns … 3.276 μs ┊ GC (min … max): 0.00% … 0.00% Time (median): 498.969 ns ┊ GC (median): 0.00% Time (mean ± σ): 515.506 ns ± 55.169 ns ┊ GC (mean ± σ): 0.00% ± 0.00% █▄▂ ▆▅▄▂▁ ▁ ██████████▇▆▆▅▄▃▆▆▆▅▄▅▄▄▅▃▅▅▃▃▃▂▄▃▃▃▄▃▄▄▄▄▄▄▄▄▄▄▄▅▄▅▄▅▄▅▄▄▃▄ █ 497 ns Histogram: log(frequency) by time 759 ns < Memory estimate: 0 bytes, allocs estimate: 0.

Their LLVM IR only have order difference. I'm not sure why this matters though.

oscardssmith · 2022-06-22T12:20:18Z

How big is the regression in speed on M1 for this? If it's not major, I'd be in favor of merging.

mikmoore · 2022-06-22T12:50:37Z

base/math.jl

+
+_isless(x::Float16, y::Float16) = signbit(widen(x) - widen(y))
+
+function min(x::T, y::T) where {T<:Union{Float32,Float64}}


Should the new min, max, minmax signatures be expanded to include Float16? Could either add it to the list or use the defined IEEEFloat union.

using T<:IEEEFloat everywhere that now uses
T<:Union{Float32,Float64} is helpful. Some systems have onchip support for Float16s, so this is a win there. The systems that emulate Float16s support min, max, minmax, so the processing is at worst unchanged there.

N5N3 · 2022-06-22T15:24:40Z

How big is the regression in speed on M1 for this? If it's not major, I'd be in favor of merging.

I have no M1 machine at hand.
@mcabbott did some bench in #45532 (comment).
The impact on foldl(min, ...) seems significant (3x slower) but I think #45581 would take care of this.
(It would be good if we can check the generated IR)

oscardssmith · 2022-06-22T15:32:48Z

In that case, I wouldn't be against merging this.

JeffreySarnoff · 2022-06-22T17:59:13Z

Each of the functions min(x,y), max(x,y), minmax(x,y) use this:
anynan = isnan(x) | isnan(y)

consider

for (T,U) in ((:Float64, :UInt64). (:Float32, :UInt32), (:Float16, :UInt16))
  @eval anynan(x::$T, y::$T) =
    isnan( reinterpret($T, 
      ( reinterpret($U, x) | reinterpret($U, y)) 
    )
end

mikmoore · 2022-06-22T18:46:24Z

Each of the functions min(x,y), max(x,y), minmax(x,y) use this: anynan = isnan(x) | isnan(y)

consider
for (T,U) in ((:Float64, :UInt64). (:Float32, :UInt32), (:Float16, :UInt16))
  @eval anynan(x::$T, y::$T) =
    isnan( reinterpret($T, 
      ( reinterpret($U, x) | reinterpret($U, y)) 
    )
end

This will erroneously report true for many input combinations such as anynan(Inf,1.5). But that aside, isnan(x)|isnan(y) can actually be evaluated in a single instruction on x86 (a check that x and y compare as unordered), so it's faster than this and at least as good as virtually any other option. I can't speak for other architectures.

JeffreySarnoff · 2022-06-22T19:22:19Z

Ok, there is a bitop fix for my error, but nevermind.
I see @code_native on an AMD chip uses the single instruction.

mikmoore · 2022-06-22T19:29:10Z

Does Julia or the LLVM use that single instruction version?

The code_native for this min on Float64:

        vsubsd  %xmm1, %xmm0, %xmm2
        vmovq   %xmm2, %rax
        vmovapd %xmm1, %xmm3
        testq   %rax, %rax
        jns     .LBB0_2
# %bb.1:                                # %top
        vmovapd %xmm0, %xmm3
.LBB0_2:                                # %top
        vcmpordsd       %xmm1, %xmm0, %xmm0
        vblendvpd       %xmm0, %xmm3, %xmm2, %xmm0
        retq

The vcmpordsd and vblendvpd correspond to the anynan determination and ifelse selection, respectively.

N5N3 · 2022-07-06T15:23:32Z

I'm planning to merge this at the weekend if CI passed and there's no other objection.

JeffreySarnoff · 2022-07-06T15:24:52Z

thank you for the work

* Accelerate `IEEEFloat`'s `min`/`max`/`minmax`/`Base._extrema_rf` * Omit unneed `BigFloat` allocation during `min`/`max`

N5N3 changed the title ~~Fix minmax for BigFloat, and make it faster for Float32/64~~ Make minmax faster for Float32/64 Jul 27, 2021

N5N3 marked this pull request as draft August 17, 2021 11:54

N5N3 marked this pull request as ready for review August 18, 2021 01:17

mcabbott mentioned this pull request Dec 17, 2021

Faster min, max, minmax #43445

Closed

N5N3 force-pushed the float_minmax branch from 6776d5d to 7d5713d Compare December 20, 2021 01:11

N5N3 mentioned this pull request Jan 8, 2022

Fix extrema(x; dims) for inputs with NaN/missing #43604

Merged

5 tasks

oscardssmith added performance Must go faster triage This should be discussed on a triage call labels Jan 10, 2022

JeffBezanson removed the triage This should be discussed on a triage call label Jan 20, 2022

N5N3 force-pushed the float_minmax branch from 7d5713d to 2c68113 Compare January 21, 2022 03:38

N5N3 force-pushed the float_minmax branch from 0441bf7 to 5adc352 Compare March 12, 2022 18:45

N5N3 force-pushed the float_minmax branch from 5adc352 to ba5537f Compare May 10, 2022 08:49

N5N3 mentioned this pull request Jun 1, 2022

faster min/max for IEEEFloats #45532

Closed

N5N3 added 2 commits June 1, 2022 10:02

Fast min/max/minmax for Float32/64

5936574

https://discourse.julialang.org/t/faster-min-max-for-float64-32/45464/40

extrema optimization for IEEEFloat

3e39c57

N5N3 added 4 commits June 1, 2022 10:02

Add BigFloat related optimization

04f07fb

Update mpfr.jl

Test min/max/minmax for all float types

39d6878

add _extrema_rf test

18f5a7c

Some performance tune.

b64fddb

N5N3 force-pushed the float_minmax branch from ba5537f to b64fddb Compare June 1, 2022 02:20

mikmoore reviewed Jun 1, 2022

View reviewed changes

mikmoore reviewed Jun 22, 2022

View reviewed changes

Merge branch 'master' into float_minmax

563ed8c

N5N3 mentioned this pull request Jul 6, 2022

Correctness issues with minimum and maximum #45932

Open

JeffBezanson approved these changes Jul 6, 2022

View reviewed changes

N5N3 merged commit dc87e60 into JuliaLang:master Jul 9, 2022

N5N3 deleted the float_minmax branch July 9, 2022 12:49

ffucci pushed a commit to ffucci/julia that referenced this pull request Aug 11, 2022

Faster min/max/minmax for float types (JuliaLang#41709)

d0a30fc

* Accelerate `IEEEFloat`'s `min`/`max`/`minmax`/`Base._extrema_rf` * Omit unneed `BigFloat` allocation during `min`/`max`

pcjentsch pushed a commit to pcjentsch/julia that referenced this pull request Aug 18, 2022

Faster min/max/minmax for float types (JuliaLang#41709)

09dfac9

* Accelerate `IEEEFloat`'s `min`/`max`/`minmax`/`Base._extrema_rf` * Omit unneed `BigFloat` allocation during `min`/`max`

mikmoore mentioned this pull request Feb 1, 2023

Optimize float min/max with constant propagation #48487

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make minmax faster for Float32/64 #41709

Make minmax faster for Float32/64 #41709

N5N3 commented Jul 26, 2021 •

edited

Loading

Seelengrab commented Jul 26, 2021 •

edited

Loading

N5N3 commented Jul 27, 2021 •

edited

Loading

Seelengrab commented Jul 27, 2021 •

edited

Loading

N5N3 commented Jul 27, 2021 •

edited

Loading

N5N3 commented Jan 10, 2022 •

edited

Loading

tkf commented Jan 18, 2022

N5N3 commented Jan 19, 2022

tkf commented Jan 20, 2022

N5N3 commented Jan 20, 2022

tkf commented Jan 20, 2022

N5N3 commented Jan 20, 2022

JeffBezanson commented Jan 20, 2022

oscardssmith commented Jan 20, 2022

N5N3 commented Jan 21, 2022 •

edited

Loading

N5N3 commented Jan 21, 2022

mikmoore Jun 1, 2022

N5N3 Jun 1, 2022 •

edited

Loading

oscardssmith commented Jun 22, 2022

mikmoore Jun 22, 2022 •

edited

Loading

JeffreySarnoff Jun 22, 2022

N5N3 commented Jun 22, 2022 •

edited

Loading

oscardssmith commented Jun 22, 2022

JeffreySarnoff commented Jun 22, 2022

mikmoore commented Jun 22, 2022 •

edited

Loading

JeffreySarnoff commented Jun 22, 2022 •

edited

Loading

mikmoore commented Jun 22, 2022

N5N3 commented Jul 6, 2022

JeffreySarnoff commented Jul 6, 2022


		function minmax(x::T, y::T) where {T<:Union{Float32,Float64}}


		_isless(x::Float16, y::Float16) = signbit(widen(x) - widen(y))

		function min(x::T, y::T) where {T<:Union{Float32,Float64}}

Make minmax faster for Float32/64 #41709

Make minmax faster for Float32/64 #41709

Conversation

N5N3 commented Jul 26, 2021 • edited Loading

Seelengrab commented Jul 26, 2021 • edited Loading

N5N3 commented Jul 27, 2021 • edited Loading

Seelengrab commented Jul 27, 2021 • edited Loading

N5N3 commented Jul 27, 2021 • edited Loading

N5N3 commented Jan 10, 2022 • edited Loading

tkf commented Jan 18, 2022

Footnotes

N5N3 commented Jan 19, 2022

tkf commented Jan 20, 2022

N5N3 commented Jan 20, 2022

tkf commented Jan 20, 2022

N5N3 commented Jan 20, 2022

JeffBezanson commented Jan 20, 2022

oscardssmith commented Jan 20, 2022

N5N3 commented Jan 21, 2022 • edited Loading

N5N3 commented Jan 21, 2022

mikmoore Jun 1, 2022

Choose a reason for hiding this comment

N5N3 Jun 1, 2022 • edited Loading

Choose a reason for hiding this comment

oscardssmith commented Jun 22, 2022

mikmoore Jun 22, 2022 • edited Loading

Choose a reason for hiding this comment

JeffreySarnoff Jun 22, 2022

Choose a reason for hiding this comment

N5N3 commented Jun 22, 2022 • edited Loading

oscardssmith commented Jun 22, 2022

JeffreySarnoff commented Jun 22, 2022

mikmoore commented Jun 22, 2022 • edited Loading

JeffreySarnoff commented Jun 22, 2022 • edited Loading

mikmoore commented Jun 22, 2022

N5N3 commented Jul 6, 2022

JeffreySarnoff commented Jul 6, 2022

N5N3 commented Jul 26, 2021 •

edited

Loading

Seelengrab commented Jul 26, 2021 •

edited

Loading

N5N3 commented Jul 27, 2021 •

edited

Loading

Seelengrab commented Jul 27, 2021 •

edited

Loading

N5N3 commented Jul 27, 2021 •

edited

Loading

N5N3 commented Jan 10, 2022 •

edited

Loading

N5N3 commented Jan 21, 2022 •

edited

Loading

N5N3 Jun 1, 2022 •

edited

Loading

mikmoore Jun 22, 2022 •

edited

Loading

N5N3 commented Jun 22, 2022 •

edited

Loading

mikmoore commented Jun 22, 2022 •

edited

Loading

JeffreySarnoff commented Jun 22, 2022 •

edited

Loading