-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comparison with BasicInterpolators.jl #5
Comments
This a bit surprising to me, since our inner loops (for the Clenshaw recurrence) are virtually identical in computational effort, especially in the 1d case. Compare Going to a larger using BasicInterpolators, FastChebInterp, BenchmarkTools
n = 200
p = ChebyshevInterpolator(sin, 0.0, 1.0, n)
q = chebfit(sin.(chebpoints(n-1, 0, 1)), 0, 1, tol=0)
@btime $p(0.1); @btime $q(0.1); gives
Any ideas? |
The following are my inner loop ( function cheb1(c, xd, i1=1)
n = length(c)
c₁ = c[i1]
if n ≤ 2
n == 1 && return c₁ + one(xd) * zero(c₁)
return c₁ + xd*c[i1]
end
@inbounds bₖ = c[i1+(n-2)] + 2xd*c[i1+(n-1)]
@inbounds bₖ₊₁ = oftype(bₖ, c[i1+(n-1)])
for j = n-3:-1:1
@inbounds bⱼ = c[i1+j] + 2xd*bₖ - bₖ₊₁
bₖ, bₖ₊₁ = bⱼ, bₖ
end
return c₁ + xd*bₖ - bₖ₊₁
end
function cheb2(a, ξ)
N = length(a)
#first two elements of cheby recursion
Tₖ₋₂ = one(ξ)
Tₖ₋₁ = ξ
#first two terms of dot product
@inbounds y = Tₖ₋₂*a[1] + Tₖ₋₁*a[2]
#cheby recursion and rest of terms in dot product, all at once
for k = 3:N
#next value in recursion
Tₖ = 2*ξ*Tₖ₋₁ - Tₖ₋₂
#next term in dot product
@inbounds y += Tₖ*a[k]
#swaps
Tₖ₋₂ = Tₖ₋₁
Tₖ₋₁ = Tₖ
end
return y
end
c = rand(200)
@btime cheb1($c, 0.3);
@btime cheb2($c, 0.3); gives
|
I wonder if it's just due to the cache effects of iterating over the coefficient array in reverse order (my code) vs forward order (your code). If so, I could fix it just by calling (My code uses the Clenshaw recurrence, which is analogous to Horner's method — evaluating the polynomial from right to left. Your code evaluates from left to right, computing the Chebyshev polynomial values by the recurrence, which analogous to evaluating 1,x,x²,x³,… iteratively from left to right and multiplying by the coefficients as you go. As far as I know, both methods are backwards stable? They seem to give similar accuracy in practice.) |
No, simply reversing the order of iteration (by reversing the coefficient order) in my code speeds things up by 5–10%, but not enough to catch up: function cheb1r(c, xd)
n = length(c)
cₙ = c[n]
if n ≤ 2
n == 1 && return cₙ + one(xd) * zero(cₙ)
return cₙ + xd*c[1]
end
@inbounds bₖ = c[2] + 2xd*c[1]
@inbounds bₖ₊₁ = oftype(bₖ, c[1])
for j = 3:n-1
@inbounds bⱼ = c[j] + 2xd*bₖ - bₖ₊₁
bₖ, bₖ₊₁ = bⱼ, bₖ
end
return cₙ + xd*bₖ - bₖ₊₁
end
@btime cheb1r($(reverse(c)), 0.3); gives
for the |
In fact, I just noticed that your inner loop requires more floating-point operations (one more multiply) than mine! (This is why people use Horner's method and Clenshaw recurrences — evaluating from right to left requires fewer multiplications.) |
I see, I see... I'm getting roughly the same timing results though. About 540 ns for cheb1r and 423 ns for cheb2. |
I think I found the issue: the compiler is apparently not finding a multiply-add opportunity in my code, but maybe found it in yours. This can be fixed by explicitly calling |
Weird. I was just doing some profiling and found that if you remove the |
Oh wow, using |
Yes, with the latest commit (3340207) I'm now getting comparable speed for 1d interpolation: though it's still a bit slower for 2d with low orders, maybe due to the overhead of recursion? |
I'm not sure exactly what's going on in the 2D case. Here the implementations do actually diverge a fair amount. My evaluation function isn't generalized for any number of dimensions and I tried to optimize it for the specific case. When I was first fiddling with it, I realized that most of the work can be reduced to a matrix multiplication, which was noticeably faster than other approaches. |
Hello hello, I'm opening the issue requested here.
I've been working on 1D and 2D Chebyshev interpolators in BasicInterpolators.jl. My code does not generalize to N-dimensions like this project. It appears to be faster, though. Here are the benchmarking code and resulting plots:
The text was updated successfully, but these errors were encountered: