-
-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
optimize @evalpoly for SIMD? #18410
Comments
Will the new algorithm still be optimum when used in a vectorized loop? |
This may be relevant for the new libm.jl work. Cc @musm |
@yuyichao, probably? Even if the compiler inlines a special-function call like julia> function foo(a)
s = zero(eltype(a))
@simd for i in eachindex(a)
s += @inbounds @evalpoly(a[i], 2.3, 4.2, 1.6, 8.9, 0.8)
end
return s
end
foo (generic function with 1 method)
julia> @code_native foo(rand(10)) if I'm reading the asm output correctly. |
I'm not worrying about inlining special functions that use The code you posted has two problems.
P.S. I strongly recommend checking |
Ah, thanks. Even so, it might be nice to have a SIMD variant of |
Or cases not meant to/possible to be vectorized in a loop in general, I agree. It would be nice if the new algorithm won't regret in the vectorize loop case but I guess the answer is that it might? |
Estrin's method ... I got bored in class~~~but the following is already more efficient than Only works for odd polynomials for now. I don't think simd is being attached however, could anyone confirm? #only for odd polynomials
#c in descending order
using BenchmarkTools
x = 2.0
c = collect(Float64, 8:-1:1)
function estrin(x,c)
n = length(c)
m = n÷2
ap = c
a = similar(c)
X = x
while n > 1
@simd for i = 1:m
@inbounds a[m+1-i] = muladd(ap[n+1-(2i-1)], X, ap[n+1-(2i-1)-1])
end
ap[:] = a[:]
X = X*X
n = m
m = m÷2
end
return ap[1]
end
f(x,c) = (@evalpoly x c[1] c[2] c[3] c[4] c[5] c[6] c[7] c[8])
@benchmark estrin(x,c)
@benchmark f(x,c)
There are a lot of other opportunities for parallelism for simple polynomial multiplication, though maybe simpler is better; for e.g. all the X*X for the various powers can also be computed in parallel. |
What are you trying to achieve with globals? |
My bad, fixed, thanks! |
There's also unnecessary copy/allocation. |
Or even better, IIUC, you can just swap the two buffers. |
I don't think you can get vectorization with non-unit stride (there's trick to kind of get it working though). I just realize this is the evalpoly issue (I thought this is one of the special/libm function ones......) in this case any allocation is already a huge loose. |
would @inbounds ap[1:m] = a[1:m] swap the buffers? or am I misunderstanding what you are saying ? |
|
@musm, rather than being "more efficient", in a quick benchmark I find your This issue is about improving the inlined |
apologies |
No worries @musm, I just wanted to nip the digression in the bud. |
On the topic of improving
I have a decent implementation of Estrin's algorithm that's a regular function (for odd polynomials, still). I have two versions: one without simd which is 10x slower than @horner and one that takes advantage of simd which is 40x slower, based on benchmarks. EDIT evaluate : c8 x^7 + c7 x^6 + ... + c1
@simd begin
a1 = c2*x + c1
a2 = c4*x + c3
a3 = c6*x + c5
a4 = c8*x + c7
end
x = x*x
@simd begin
a1 = a2*x + a1
a2 = a4*x + a3
end
x = x*x
@simd begin
a1 = a2*x + a1
end
return a1 |
|
The SIMD package provides the respective operations, based on using SIMD
function evalpoly{T}(c1,c2,c3,c4,c5,c6,c7,c8, x::T)
c2468 = Vec{4,T}((c2,c4,c6,c8))
c1357 = Vec{4,T}((c1,c3,c5,c7))
a1234 = c2468 * x + c1357
x = x^2
a24 = Vec{2,T}((a1234[2], a1234[4]))
a13 = Vec{2,T}((a1234[1], a1234[3]))
a12 = a24 * x + a13
x = x^2
a2 = Vec{1,T}((a12[2],))
a1 = Vec{1,T}((a12[1],))
r = a2 * x + a1
r[1]
end
evalpoly(1.0,2.0,3.0,4.0,5.0,6.0,7.0,8.0, 0.5)
@code_native evalpoly(1.0,2.0,3.0,4.0,5.0,6.0,7.0,8.0, 0.5) If you look at the generated code, you'll see immediately why this turns out to be very slow:
There are fewer actual arithmetic operations ( The first six operations ( |
Thanks @eschnett I created a macro that expands out the operation's in Estrin's algorithm to explore possible benefits of vectorization for polynomial evaluation using your SIMD package. See: |
@musm Nice! The expensive part is currently switching between different vector sizes. The following might lead to a faster algorithm:
However, if this |
@eschnett I'll try to take those points into consideration in an improved implementation. Another very simple option for parallelism in using BenchmarkTools
using Base.Math.@horner
c0, c1, c2, c3, c4, c5, c6, c7, c8, c9, c10 = (1:11...)
x = 2.0
function f1(x, c0, c1, c2, c3, c4, c5, c6, c7, c8, c9, c10)
x2 = x*x
begin # ideally we should be able to tell the compiler to run these two in parallel, is threads the right approach here? I.e. run these two on separate threads
cc1 = @horner x2 c1 c3 c5 c7 c9 # the odds
cc2 = @horner x2 c2 c4 c6 c8 c10 # the evens
end
c0 + x*cc1 + x2 * cc2
end
function f2(x, c0, c1, c2, c3, c4, c5, c6, c7, c8, c9, c10)
@horner x c0 c1 c2 c3 c4 c5 c6 c7 c8 c9 c10
end
g1(x) = f1(x, c0, c1, c2, c3, c4, c5, c6, c7, c8, c9, c10)
g2(x) = f2(x, c0, c1, c2, c3, c4, c5, c6, c7, c8, c9, c10)
xx = 5.0*randn(1_000_000)
@show @benchmark g1.(xx)
@show @benchmark g2.(xx) On my computer with -O3 this gives
This is about x3 faster. Is the compiler automatically able to determine that the two horner evaluations can be run concurrently in this example? |
Julia will not yet use multiple threads for concurrent evaluation, but perhaps this is benefiting due to better codegen. |
It's very unlikely cheap operations like this will benefit from threading at all. |
@musm The time difference seems to be due partially to much higher gc. Can you either call |
Also, I wonder if having all the variables in global scope affects this. Perhaps wrap everything in a function? |
|
I take that back, I didn't realize that all the |
If you see that you are allocating 300 MB in your benchmark that should just evaluate a polynomial then you should look closer in how you are benchmarking. The absolute timings you are getting are also unreasonable with some napkin math of cpu freq and number of ops required. |
Because the coefficients are known at parse time in the macro, it should be able to decide on the optimal vector size at parse time and/or compile-time. There should be no runtime switching. (There can be checks like |
(@musm, this kind of low-level performance hacking is probably not a good introductory issue, because it requires a lot of understanding of Julia internals.) |
An even/odd splitting like this is likely to increase performance. Modern CPUs can execute instructions out-of-order, and given the number of execution ports they have, it is likely that both streams can be executed simultaneously. Alternatively, you can execute the two streams simultaneously by using 2-element SIMD vectors. That's very similar to what I suggested earlier, except with a vector size of 2 instead of 4. As others said, multi-threading will not help here. Threading introduces an overhead of about 1 us (roughly), whereas instructions are executed in fractions of a nanosecond. Thus you'll need tens of thousands of instructions to recoup the threading overhead. |
Sorry about the lack of rigorous benchmarking, I should've known better, but It was late : P To test this I implemented a macro that does an even odd split. You can do using Benchmarks, PyPlot
using Base.Math.@horner, ParPoly
x = 2.0
nn = 4:25
t_horner_split = []
t_horner = []
xt_horner_split = []
xt_horner = []
t_horner_split_m = []
t_horner_m = []
for n in nn
println("running bench for order $(n-1)")
c = randn(Float64,n)
@gensym f1
@gensym f2
@eval begin
$f1(x) = @horner_split x $(c...)
$f2(x) = @horner x $(c...)
t_hs = @benchmark $f1(x)
t_h = @benchmark $f2(x)
end
push!(t_horner_split_m, Benchmarks.SummaryStatistics(t_hs).elapsed_time_center)
push!(t_horner_m, Benchmarks.SummaryStatistics(t_h).elapsed_time_center)
end
plot(nn-1, t_horner_split_m,label="horner_split")
plot(nn-1, t_horner_m,label="horner")
legend(loc="upper left")
xlabel("polynomial order")
ylabel("time (ns)")
If I understand correctly do you mean something like the following? x1 = 2.0
x2 = x1*x1
x1x2 = Vec{2,Float64}((x1,x2))
xx = Vec{2,Float64}((x2,x2))
cc1 = Vec{2,Float64}((c1,c2))
cc2 = Vec{2,Float64}((c3,c4))
cc3 = Vec{2,Float64}((c5,c6))
cc4 = Vec{2,Float64}((c7,c8))
cc5 = Vec{2,Float64}((c9,c10))
C = (cc1+xx*(cc2 + xx*(cc3 + xx*(cc4 + xx*cc5))))
val = c0 + sum(x1x2*C) This would need to allocate a lot of new variables (not sure such a large number of allocation would outweigh the possible increase of execution speed) I would think that a more parallelized version would benefit stronger from such an approach a la estrin's scheme or similar. |
Again, don't use the broadcast to benchmark, use the scalar form directly. Also those |
|
I also believe that you should be seeing the function overwrite warning and you should not ignore it. You are most likely benchmarking only a single function which is why the scaling in your result doesn't make sense. |
@yuyichao Does that look better? I do see them, the methods get overwritten, what's the problem then? Should probably be semilogy plot... |
|
The time the low order polynomials is way to long. You shouldn't get tens of ns for that unless you are using a really slow CPU. Try, for n in nn
c = randn(Float64,n)
@gensym f1
@gensym f2
@eval begin
$f1(x) = @horner_split x $(c...)
$f2(x) = @horner x $(c...)
t_hs = @benchmark $f1($x)
t_h = @benchmark $f2($x)
end
push!(t_horner_split, median(t_hs).time)
push!(t_horner, median(t_h).time)
end |
@musm, good investigation – benchmarking this kind of stuff is tricky and this is how you learn :) |
It does seem that you might hit JuliaCI/BenchmarkTools.jl#7, which is why I still use |
Anecdotally using Benchmarks.jl, EDIT: code: https://github.com/musm/ParPoly.jl/blob/master/bench/bench.jl |
I've been doing some more benchmarking and I see very good results with the simple In any case using the horner_split macro in the libm.jl package, I see very nice perf gains. c.f. SIMD version julia> @code_llvm g(x)
; Function Attrs: uwtable
define double @julia_g_66617(double) #0 {
top:
%1 = fmul double %0, %0
%2 = insertelement <2 x double> undef, double %0, i32 0
%3 = insertelement <2 x double> %2, double %1, i32 1
%4 = insertelement <2 x double> undef, double %1, i32 0
%5 = insertelement <2 x double> %4, double %1, i32 1
%res.i = call <2 x double> @llvm.fmuladd.v2f64(<2 x double> %5, <2 x double> <double 1.400000e+01, double 1.400000e+01>, <2 x double> <double 1.200000e+01, double 1.300000e+01>)
%res.i.1 = call <2 x double> @llvm.fmuladd.v2f64(<2 x double> %5, <2 x double> %res.i, <2 x double> <double 1.000000e+01, double 1.100000e+01>)
%res.i.2 = call <2 x double> @llvm.fmuladd.v2f64(<2 x double> %5, <2 x double> %res.i.1, <2 x double> <double 8.000000e+00, double 9.000000e+00>)
%res.i.3 = call <2 x double> @llvm.fmuladd.v2f64(<2 x double> %5, <2 x double> %res.i.2, <2 x double> <double 6.000000e+00, double 7.000000e+00>)
%res.i.4 = call <2 x double> @llvm.fmuladd.v2f64(<2 x double> %5, <2 x double> %res.i.3, <2 x double> <double 4.000000e+00, double 5.000000e+00>)
%res.i.5 = call <2 x double> @llvm.fmuladd.v2f64(<2 x double> %5, <2 x double> %res.i.4, <2 x double> <double 2.000000e+00, double 3.000000e+00>)
%res.i.6 = fmul <2 x double> %3, %res.i.5
%vec_1_1.i = shufflevector <2 x double> %res.i.6, <2 x double> undef, <1 x i32> zeroinitializer
%vec_1_2.i = shufflevector <2 x double> %res.i.6, <2 x double> undef, <1 x i32> <i32 1>
%vec_1.i = fadd <1 x double> %vec_1_1.i, %vec_1_2.i
%res.i.7 = extractelement <1 x double> %vec_1.i, i32 0
%6 = fadd double %res.i.7, 1.000000e+00
ret double %6
} Serial version of the split macro julia> @code_llvm f(x)
; Function Attrs: uwtable
define double @julia_f_66618(double) #0 {
top:
%1 = fmul double %0, %0
%2 = call double @llvm.fmuladd.f64(double %1, double 1.400000e+01, double 1.200000e+01)
%3 = call double @llvm.fmuladd.f64(double %1, double %2, double 1.000000e+01)
%4 = call double @llvm.fmuladd.f64(double %1, double %3, double 8.000000e+00)
%5 = call double @llvm.fmuladd.f64(double %1, double %4, double 6.000000e+00)
%6 = call double @llvm.fmuladd.f64(double %1, double %5, double 4.000000e+00)
%7 = call double @llvm.fmuladd.f64(double %1, double %6, double 2.000000e+00)
%8 = call double @llvm.fmuladd.f64(double %1, double 1.400000e+01, double 1.300000e+01)
%9 = call double @llvm.fmuladd.f64(double %1, double %8, double 1.100000e+01)
%10 = call double @llvm.fmuladd.f64(double %1, double %9, double 9.000000e+00)
%11 = call double @llvm.fmuladd.f64(double %1, double %10, double 7.000000e+00)
%12 = call double @llvm.fmuladd.f64(double %1, double %11, double 5.000000e+00)
%13 = call double @llvm.fmuladd.f64(double %1, double %12, double 3.000000e+00)
%14 = fmul double %7, %0
%15 = fadd double %14, 1.000000e+00
%16 = fmul double %1, %13
%17 = fadd double %15, %16
ret double %17 |
Seems like we should update the |
Well, technically it would no longer be a Horner scheme (there is a chance that accuracy guarantees are predicated on it being evaluated in this manner) |
I think Viral meant update |
For help with these sort of low-level optimisations, you might want to check out https://github.com/carnaval/IACA.jl |
Edit: sensible benchmark results posted above. Looks like the overhead of setting up SIMD doesn't pay off until the polynomial is order 18 and I'm guessing very few people are evaluating polynomials greater than order 20... |
Did you try setting up more than 2 independent instruction streams? E.g. using 4 streams might be even faster, given that multiplication and addition has several cycles of latency. |
Reviving this – I've started a (WIP) package for SIMD polynomial evaluation here: https://github.com/augustt198/SIMDPoly.jl. Using 4-wide vectors with real inputs it seems SIMD becomes advantageous around degree 12, if my benchmarking is correct. SIMD polynomial division with complex inputs are around 2x faster than SIMD.jl is a dependency so I don't think there's much that could be transferred to Base. |
@augustt198 nice work In your benchmark code https://github.com/augustt198/SIMDPoly.jl/blob/master/test/benchmark2.jl realistically it should also be timing the time it takes to pack the coefficients as well if it's to compare fairly. The timings including packing highlight what @eschnett mentioned back in #18410 (comment) they become hopelessly slow for the SIMD version 😢 |
Sure I can add a disclaimer about that. I'm mostly interested in cases where the coefficients are constants: Taylor series, rational approximations, etc. |
True, I guess in practice that's not the main bottleneck. So perhaps just ignore my comment then 😄 |
Right now,
@evalpoly(x, coefs...)
uses Horner's rule for realx
, and a more complicated algorithm for complexx
, and is performance-critical for evaluation of special functions.However, it seems likely to be faster to try and exploit SIMD instructions for these polynomial evaluations, e.g. with Estrin's algorithm or some other technique.
(The nice thing about code generation / macros is that we can investigate fancy methods for this sort of thing, protected by an
isa(x,Float64)
check that will get optimized out at compile-time.)The text was updated successfully, but these errors were encountered: