-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make minmax faster for Float32/64 #41709
Conversation
The speedup for |
Since |
If the change in allocations for It's possible there's some usercode out there relying on it being |
I'm not familiar with julia> a = BigFloat(1//3,300); b = BigFloat(1//5,300);
julia> min(a, b) == b
false It seems to be a better example for switching to a non-allocating version? |
I think we'd better pick this PR up. @oscardssmith @tkf
For reference: the behavior of
|
I consider Footnotes
|
|
Hmm.... can the promotion of the precision be added to |
If we do that, I guess all other math calls should be modified: julia> b = BigFloat(1.0,3000) + BigFloat(1.0,200)
2.0
julia> b.prec
256 Maybe we can avoid allocation in |
Actually, I have no idea what users would expect about the output precision of |
I personally seldom use |
From triage: (1) We don't need to copy a BigFloat, ok to return the actual element we find. (2) It doesn't matter which NaN we return when there are multiple. |
For |
I extend the current
|
Update the lastest Benchmark: for T in (Float32, Float64, Float16, BigFloat)
a = T.(randn(128)); b = T.(randn(128)); c = similar(a);
d = minmax.(a,b);
t1 = @benchmark $c .= min.($a,$b);
t2 = @benchmark $c .= max.($a,$b);
t3 = @benchmark $d .= minmax.($a,$b);
print(T, "| min: ", round(median(t1).time, digits = 2), "ns")
print(" max: ", round(median(t2).time, digits = 2), "ns")
println(" minmax: ", round(median(t3).time, digits = 2), "ns")
end This PR: Float32| min: 21.34ns max: 21.26ns minmax: 40.73ns
Float64| min: 34.07ns max: 34.14ns minmax: 69.43ns
Float16| min: 173.6ns max: 174.18ns minmax: 223.15ns
BigFloat| min: 1610.0ns max: 1540.0ns minmax: 1630.0ns Master: Float32| min: 28.82ns max: 28.92ns minmax: 231.8ns
Float64| min: 46.87ns max: 46.77ns minmax: 243.28ns
Float16| min: 317.72ns max: 318.14ns minmax: 262.72ns
BigFloat| min: 5216.67ns max: 5200.0ns minmax: 2588.89ns |
|
||
function minmax(x::T, y::T) where {T<:Union{Float32,Float64}} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this detailed definition necessary? It seems like a more generic
minmax(x::T, y::T) where {T<:Union{Float32,Float64}} = min(x, y), max(x, y)
would perform identically thanks to inlining.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My local bench does show some performance difference.
julia> a = randn(1024); b = randn(1024); z = min.(a,b); zz = minmax.(a,b);
julia> using BenchmarkTools
[ Info: Precompiling BenchmarkTools [6e4b80f9-dd63-53aa-95a3-0cdb28fa8baf]
julia> @benchmark $zz .= minmax.($a, $b)
BenchmarkTools.Trial: 10000 samples with 198 evaluations.
Range (min … max): 440.404 ns … 1.189 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 442.424 ns ┊ GC (median): 0.00%
Time (mean ± σ): 457.627 ns ± 45.207 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
█
█▂▃▁▅▃▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▁
440 ns Histogram: frequency by time 702 ns <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> f(x, y) = min(x, y), max(x, y)
f (generic function with 1 method)
julia> @benchmark $zz .= f.($a, $b)
BenchmarkTools.Trial: 10000 samples with 194 evaluations.
Range (min … max): 497.423 ns … 3.276 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 498.969 ns ┊ GC (median): 0.00%
Time (mean ± σ): 515.506 ns ± 55.169 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
█▄▂ ▆▅▄▂▁ ▁
██████████▇▆▆▅▄▃▆▆▆▅▄▅▄▄▅▃▅▅▃▃▃▂▄▃▃▃▄▃▄▄▄▄▄▄▄▄▄▄▄▅▄▅▄▅▄▅▄▄▃▄ █
497 ns Histogram: log(frequency) by time 759 ns <
Memory estimate: 0 bytes, allocs estimate: 0.
Their LLVM IR only have order difference. I'm not sure why this matters though.
How big is the regression in speed on M1 for this? If it's not major, I'd be in favor of merging. |
|
||
_isless(x::Float16, y::Float16) = signbit(widen(x) - widen(y)) | ||
|
||
function min(x::T, y::T) where {T<:Union{Float32,Float64}} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the new min
, max
, minmax
signatures be expanded to include Float16
? Could either add it to the list or use the defined IEEEFloat
union.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
using T<:IEEEFloat
everywhere that now uses
T<:Union{Float32,Float64}
is helpful. Some systems have onchip support for Float16s, so this is a win there. The systems that emulate Float16s support min, max, minmax, so the processing is at worst unchanged there.
I have no M1 machine at hand. |
In that case, I wouldn't be against merging this. |
Each of the functions consider
|
This will erroneously report true for many input combinations such as |
Ok, there is a bitop fix for my error, but nevermind. |
The
The |
I'm planning to merge this at the weekend if CI passed and there's no other objection. |
thank you for the work |
* Accelerate `IEEEFloat`'s `min`/`max`/`minmax`/`Base._extrema_rf` * Omit unneed `BigFloat` allocation during `min`/`max`
* Accelerate `IEEEFloat`'s `min`/`max`/`minmax`/`Base._extrema_rf` * Omit unneed `BigFloat` allocation during `min`/`max`
This PR tries to fixThe allocations inminmax
for BigFloat, as:BigFloat
'smin/max
seems disputable. This PR just makes the outputs ofminmax
consistent with that ofmin/max
.I just wonder which style is better? This PR make min max also return the first of two NaNs (no matter their signs) like LLVM and Julia's BigFloat.
min/max
a little faster forIEEEFloat
.At the same time,
minmax
forFloat32/64
could be proper vectorlized with broadcast now.Some benchmarks: