Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make minmax faster for Float32/64 #41709

Merged
merged 7 commits into from
Jul 9, 2022
Merged

Make minmax faster for Float32/64 #41709

merged 7 commits into from
Jul 9, 2022

Conversation

N5N3
Copy link
Member

@N5N3 N5N3 commented Jul 26, 2021

  1. This PR tries to fix minmax for BigFloat, as: The allocations in BigFloat's min/max seems disputable. This PR just makes the outputs of minmax consistent with that of min/max.
julia> a = BigFloat(1//3, 300); b = BigFloat(3//5, 300);
julia> min(a, b) === a
false    -->  false
julia> max(a, b) === b
false    -->  false
julia> minmax(a, b) === (a, b)
true     -->  false
julia> min(a, b) == a
false    -->  false
julia> max(a, b) == b
false    -->  false
julia> minmax(a, b) == (a, b)
true     -->  false
  1. At present, if the two inputs are NaNs with different signs:
  • Base.min always returns the one with positive sign
  • Base.max always returns the one with negetive sign
  • Base.minmax(x, y) always returns (x, x)
    I just wonder which style is better? This PR make min max also return the first of two NaNs (no matter their signs) like LLVM and Julia's BigFloat.
  1. This PR also makes min/max a little faster for IEEEFloat.
    At the same time, minmax for Float32/64 could be proper vectorlized with broadcast now.
    Some benchmarks:
julia> a = randn(100); b = randn(100); c = similar(a); d = minmax.(a,b); @btime $c .= min.($a,$b); @btime $c .= max.($a,$b); @btime $d .= minmax.($a,$b); 
  43.448 ns (0 allocations: 0 bytes)   --->   33.494 ns (0 allocations: 0 bytes)
  44.040 ns (0 allocations: 0 bytes)   --->   37.803 ns (0 allocations: 0 bytes)
  233.696 ns (0 allocations: 0 bytes)  --->   52.584 ns (0 allocations: 0 bytes)

julia> a = randn(Float32, 100); b = randn(Float32, 100); c = similar(a); d = minmax.(a,b); @btime $c .= min.($a,$b); @btime $c .= max.($a,$b); @btime $d .= minmax.($a,$b);
  29.819 ns (0 allocations: 0 bytes)   --->    24.172 ns (0 allocations: 0 bytes)
  29.789 ns (0 allocations: 0 bytes)   --->    23.447 ns (0 allocations: 0 bytes)
  223.353 ns (0 allocations: 0 bytes)  --->    40.664 ns (0 allocations: 0 bytes)

julia> a = randn(Float16, 100); b = randn(Float16, 100); c = similar(a); d = minmax.(a,b); @btime $c .= min.($a,$b); @btime $c .= max.($a,$b); @btime $d .= minmax.($a,$b);
  239.862 ns (0 allocations: 0 bytes)  --->   135.865 ns (0 allocations: 0 bytes)
  243.970 ns (0 allocations: 0 bytes)  --->   136.424 ns (0 allocations: 0 bytes)
  206.833 ns (0 allocations: 0 bytes)  --->   200.500ns (0 allocations: 0 bytes)

@Seelengrab
Copy link
Contributor

Seelengrab commented Jul 26, 2021

The speedup for Float16/32/64 is nice, but I'm not sure why allocating more for BigFloat is more correct? I'd have expected a non-allocating version to be more in line with what min and max are supposed to do according to their docstrings, namely returning the min/max of its arguments (not a copy thereof).

@N5N3
Copy link
Member Author

N5N3 commented Jul 27, 2021

Since BigFloat is introduced in Base.MPFR, maybe such change should be done in a seperated PR?
While Base.Math only offers the generic version of AbstractFloat's min/max/minmax, and make them fast for IEEEFloat.

@N5N3 N5N3 changed the title Fix minmax for BigFloat, and make it faster for Float32/64 Make minmax faster for Float32/64 Jul 27, 2021
@Seelengrab
Copy link
Contributor

Seelengrab commented Jul 27, 2021

If the change in allocations for BigFloat wasn't intended and happened incidentally, I'd be in favor of just doing it in this PR since it touches the same (user facing) API, touching a different file seems fine with me. I think it'd just be a method specialization for BigFloat with a non-allocating version, no? Does MPFR expose some min/max API we can use directly without having to implement it ourselves?

It's possible there's some usercode out there relying on it being !== right now, but I guess we'll find out with the PkgEval runs for the next version.

@N5N3
Copy link
Member Author

N5N3 commented Jul 27, 2021

I'm not familiar with libmpfr, but Julia now use the 4 args api (2 inputs + output + rounding setting) to calculate the min/max of BigFloat, and the following code return false, as the default prec is 256.

julia> a = BigFloat(1//3,300); b = BigFloat(1//5,300);
julia> min(a, b) == b
false

It seems to be a better example for switching to a non-allocating version?

@N5N3 N5N3 marked this pull request as draft August 17, 2021 11:54
@N5N3 N5N3 marked this pull request as ready for review August 18, 2021 01:17
@N5N3
Copy link
Member Author

N5N3 commented Jan 10, 2022

I think we'd better pick this PR up. @oscardssmith @tkf
List for triage:

  1. Do we really need allocation in min/max for BigFloat.
  2. Should we treat copysign(NaN, -1) > NaN in min/max for AbstractFloat. (Or just always return the first NaN in inputs like llvm.minimum and Base.minmax)

For reference: the behavior of 2 is undefined in lastest IEEE754.

If two or more inputs are NaN, then the payload of the resulting NaN should be identical to the payload of one of the input NaNs if representable in the destination format. This standard does not specify which of
the input NaNs will provide the payload.

@oscardssmith oscardssmith added performance Must go faster triage This should be discussed on a triage call labels Jan 10, 2022
@tkf
Copy link
Member

tkf commented Jan 18, 2022

I consider min/max to be defined with respect to total orderings 1 isless and Base.isgreater since this is a crucial property for parallelism. The result happens to coincide with IEEE 754 for floats and the "poisoning" semantics of missing. Since min/max/minimum/maximum/extrema/etc. provide the "fist match" semantics, I think it makes sense to reuse the input, even for BigFloat NaNs.

Footnotes

  1. Or maybe more precisely "partial orderings that are total on the many of the subsets of values people tend to care"

@N5N3
Copy link
Member Author

N5N3 commented Jan 19, 2022

BigFloat is a little bit complex because of it's "variable" precison.
Should we do promotion in min(x::BigFloat, y::BigFloat) if they have different precision?

@tkf
Copy link
Member

tkf commented Jan 20, 2022

Hmm.... can the promotion of the precision be added to promote on BigFloats? Though I guess that's too breaking...

@N5N3
Copy link
Member Author

N5N3 commented Jan 20, 2022

If we do that, I guess all other math calls should be modified:

julia> b = BigFloat(1.0,3000) + BigFloat(1.0,200)
2.0
julia> b.prec
256

Maybe we can avoid allocation in min/max only when all inputs has the default precision?

@tkf
Copy link
Member

tkf commented Jan 20, 2022

Actually, I have no idea what users would expect about the output precision of BigFloat operations. I'm no expert so I'll shut up :)

@N5N3
Copy link
Member Author

N5N3 commented Jan 20, 2022

I personally seldom use BigFloat. So let's wait the triage result _(:зゝ∠)_.

@JeffBezanson
Copy link
Member

From triage: (1) We don't need to copy a BigFloat, ok to return the actual element we find. (2) It doesn't matter which NaN we return when there are multiple.

@JeffBezanson JeffBezanson removed the triage This should be discussed on a triage call label Jan 20, 2022
@oscardssmith
Copy link
Member

For BigFloat there's no reason to promote (it would just be slower and not more useful). The last thing to check is that this does correct things with -0.0. Is this code path sufficiently tested?

@N5N3
Copy link
Member Author

N5N3 commented Jan 21, 2022

I extend the current min/max/minmax test for Float64 to all float types in Base. (5dd01a2)
Local test shows no problem.

_extrema_rf related optimization is also added in this PR, since it's min/max related. (I think there's no need to open a new PR)

@N5N3
Copy link
Member Author

N5N3 commented Jan 21, 2022

Update the lastest Benchmark:

for T in (Float32, Float64, Float16, BigFloat)
    a = T.(randn(128)); b = T.(randn(128)); c = similar(a); 
    d = minmax.(a,b); 
    t1 = @benchmark $c .= min.($a,$b); 
    t2 = @benchmark $c .= max.($a,$b); 
    t3 = @benchmark $d .= minmax.($a,$b);
    print(T, "|  min: ", round(median(t1).time, digits = 2), "ns")
    print("  max: ", round(median(t2).time, digits = 2), "ns")
    println("  minmax: ", round(median(t3).time, digits = 2), "ns")
end

This PR:

Float32|  min: 21.34ns  max: 21.26ns  minmax: 40.73ns
Float64|  min: 34.07ns  max: 34.14ns  minmax: 69.43ns
Float16|  min: 173.6ns  max: 174.18ns  minmax: 223.15ns
BigFloat|  min: 1610.0ns  max: 1540.0ns  minmax: 1630.0ns

Master:

Float32|  min: 28.82ns  max: 28.92ns  minmax: 231.8ns
Float64|  min: 46.87ns  max: 46.77ns  minmax: 243.28ns
Float16|  min: 317.72ns  max: 318.14ns  minmax: 262.72ns
BigFloat|  min: 5216.67ns  max: 5200.0ns  minmax: 2588.89ns


function minmax(x::T, y::T) where {T<:Union{Float32,Float64}}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this detailed definition necessary? It seems like a more generic
minmax(x::T, y::T) where {T<:Union{Float32,Float64}} = min(x, y), max(x, y)
would perform identically thanks to inlining.

Copy link
Member Author

@N5N3 N5N3 Jun 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My local bench does show some performance difference.

julia> a = randn(1024); b = randn(1024); z = min.(a,b); zz = minmax.(a,b);

julia> using BenchmarkTools
[ Info: Precompiling BenchmarkTools [6e4b80f9-dd63-53aa-95a3-0cdb28fa8baf]

julia> @benchmark $zz .= minmax.($a, $b)
BenchmarkTools.Trial: 10000 samples with 198 evaluations.
 Range (min  max):  440.404 ns   1.189 μs  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     442.424 ns              ┊ GC (median):    0.00%        
 Time  (mean ± σ):   457.627 ns ± 45.207 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █    
  █▂▃▁▅▃▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▁
  440 ns          Histogram: frequency by time          702 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> f(x, y) = min(x, y), max(x, y)
f (generic function with 1 method)

julia> @benchmark $zz .= f.($a, $b)
BenchmarkTools.Trial: 10000 samples with 194 evaluations.
 Range (min  max):  497.423 ns   3.276 μs  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     498.969 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   515.506 ns ± 55.169 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █▄▂ ▆▅▄▂▁                                                    ▁
  ██████████▇▆▆▅▄▃▆▆▆▅▄▅▄▄▅▃▅▅▃▃▃▂▄▃▃▃▄▃▄▄▄▄▄▄▄▄▄▄▄▅▄▅▄▅▄▅▄▄▃▄ █
  497 ns        Histogram: log(frequency) by time       759 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

Their LLVM IR only have order difference. I'm not sure why this matters though.

@oscardssmith
Copy link
Member

How big is the regression in speed on M1 for this? If it's not major, I'd be in favor of merging.


_isless(x::Float16, y::Float16) = signbit(widen(x) - widen(y))

function min(x::T, y::T) where {T<:Union{Float32,Float64}}
Copy link
Contributor

@mikmoore mikmoore Jun 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the new min, max, minmax signatures be expanded to include Float16? Could either add it to the list or use the defined IEEEFloat union.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

using T<:IEEEFloat everywhere that now uses
T<:Union{Float32,Float64} is helpful. Some systems have onchip support for Float16s, so this is a win there. The systems that emulate Float16s support min, max, minmax, so the processing is at worst unchanged there.

@N5N3
Copy link
Member Author

N5N3 commented Jun 22, 2022

How big is the regression in speed on M1 for this? If it's not major, I'd be in favor of merging.

I have no M1 machine at hand.
@mcabbott did some bench in #45532 (comment).
The impact on foldl(min, ...) seems significant (3x slower) but I think #45581 would take care of this.
(It would be good if we can check the generated IR)

@oscardssmith
Copy link
Member

In that case, I wouldn't be against merging this.

@JeffreySarnoff
Copy link
Contributor

Each of the functions min(x,y), max(x,y), minmax(x,y) use this:
anynan = isnan(x) | isnan(y)

consider

for (T,U) in ((:Float64, :UInt64). (:Float32, :UInt32), (:Float16, :UInt16))
  @eval anynan(x::$T, y::$T) =
    isnan( reinterpret($T, 
      ( reinterpret($U, x) | reinterpret($U, y)) 
    )
end

@mikmoore
Copy link
Contributor

mikmoore commented Jun 22, 2022

Each of the functions min(x,y), max(x,y), minmax(x,y) use this: anynan = isnan(x) | isnan(y)

consider

for (T,U) in ((:Float64, :UInt64). (:Float32, :UInt32), (:Float16, :UInt16))
  @eval anynan(x::$T, y::$T) =
    isnan( reinterpret($T, 
      ( reinterpret($U, x) | reinterpret($U, y)) 
    )
end

This will erroneously report true for many input combinations such as anynan(Inf,1.5). But that aside, isnan(x)|isnan(y) can actually be evaluated in a single instruction on x86 (a check that x and y compare as unordered), so it's faster than this and at least as good as virtually any other option. I can't speak for other architectures.

@JeffreySarnoff
Copy link
Contributor

JeffreySarnoff commented Jun 22, 2022

Ok, there is a bitop fix for my error, but nevermind.
I see @code_native on an AMD chip uses the single instruction.

@mikmoore
Copy link
Contributor

Does Julia or the LLVM use that single instruction version?

The code_native for this min on Float64:

        vsubsd  %xmm1, %xmm0, %xmm2
        vmovq   %xmm2, %rax
        vmovapd %xmm1, %xmm3
        testq   %rax, %rax
        jns     .LBB0_2
# %bb.1:                                # %top
        vmovapd %xmm0, %xmm3
.LBB0_2:                                # %top
        vcmpordsd       %xmm1, %xmm0, %xmm0
        vblendvpd       %xmm0, %xmm3, %xmm2, %xmm0
        retq

The vcmpordsd and vblendvpd correspond to the anynan determination and ifelse selection, respectively.

@N5N3
Copy link
Member Author

N5N3 commented Jul 6, 2022

I'm planning to merge this at the weekend if CI passed and there's no other objection.

@JeffreySarnoff
Copy link
Contributor

thank you for the work

@N5N3 N5N3 merged commit dc87e60 into JuliaLang:master Jul 9, 2022
@N5N3 N5N3 deleted the float_minmax branch July 9, 2022 12:49
ffucci pushed a commit to ffucci/julia that referenced this pull request Aug 11, 2022
* Accelerate `IEEEFloat`'s `min`/`max`/`minmax`/`Base._extrema_rf`
* Omit unneed `BigFloat` allocation during `min`/`max`
pcjentsch pushed a commit to pcjentsch/julia that referenced this pull request Aug 18, 2022
* Accelerate `IEEEFloat`'s `min`/`max`/`minmax`/`Base._extrema_rf`
* Omit unneed `BigFloat` allocation during `min`/`max`
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Must go faster
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants