Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Comparison with BasicInterpolators.jl #5

Closed
markmbaum opened this issue Sep 10, 2021 · 11 comments
Closed

Comparison with BasicInterpolators.jl #5

markmbaum opened this issue Sep 10, 2021 · 11 comments

Comments

@markmbaum
Copy link

markmbaum commented Sep 10, 2021

Hello hello, I'm opening the issue requested here.

I've been working on 1D and 2D Chebyshev interpolators in BasicInterpolators.jl. My code does not generalize to N-dimensions like this project. It appears to be faster, though. Here are the benchmarking code and resulting plots:

using BasicInterpolators, FastChebInterp, BenchmarkTools, Statistics, Plots

##

#numbers of points to test
N = 4:2:32
#timing results
t = zeros(length(N),2) #timing for BasicInterpolators
u = zeros(length(N),2) #timing for FastChebInterp
#test coordinates
xx = rand(1000)
pts = [rand(2) for i ∈ 1:1000]

#one dimensional cases
for (i,n) ∈ enumerate(N)

    #BasicInterpolators 1D
    p = ChebyshevInterpolator(sin, 0.0, 1.0, n)
    b = @benchmark map(x -> $p(x), $xx)
    t[i,1] = mean(b.times)

    #FastChebInterp 1D
    x = chebpoints(n-1, 0, 1)
    q = chebfit(sin.(x), 0.0, 1.0, tol=0.0)
    b = @benchmark map(x -> $q(x), $xx)
    u[i,1] = mean(b.times)

    #BasicInterpolators 2D
    P = BichebyshevInterpolator((x,y)->sin(x)*cos(y), 0, 1, n, 0, 1, n)
    b = @benchmark map(x -> $P.(x,x), $xx)
    t[i,2] = mean(b.times)

    #FastChebInterp 1D
    x = chebpoints(n-1, 0, 1)
    X = x' .* ones(n)
    Y = ones(n)' .* x
    Q = chebfit(sin.(X).*cos.(Y), [0.0,0.0], [1.0,1.0])
    b = @benchmark map(pt -> $Q(pt), $pts)
    u[i,2] = mean(b.times)
end

##

p₁ = plot(N, t[:,1]*1e-6,
    label="BasicInterpolators",
    xlabel="Number of Points",
    ylabel="Time per 1000 interpolations (μs)",
    title="One Dimension")
plot!(p₁, N, u[:,1]*1e-6, label="FastChebInterp")
#savefig(p, "one_dimension")

p₂ = plot(N, t[:,2]*1e-6,
    label="BasicInterpolators",
    xlabel="Number of Points",
    ylabel="Time per 1000 interpolations (μs)",
    title="Two Dimensions")
plot!(p₂, N, u[:,2]*1e-6, label="FastChebInterp")
#savefig(p, "two_dimensions")

one_dimension
two_dimensions

@stevengj
Copy link
Member

stevengj commented Sep 10, 2021

This a bit surprising to me, since our inner loops (for the Clenshaw recurrence) are virtually identical in computational effort, especially in the 1d case. Compare
https://github.com/markmbaum/BasicInterpolators.jl/blob/48d8e0442761744c5447a45e358093244d7636a0/src/chebyshev.jl#L149-L157

with
https://github.com/stevengj/FastChebInterp.jl/blob/fab7e9d4f478f1f37fa9f28c96af72ca9cc40d66/src/eval.jl#L27-L29

Going to a larger n=200, just to eliminate any overheads from argument checks etcetera, I'm still getting a 30% slowdown:

using BasicInterpolators, FastChebInterp, BenchmarkTools
n = 200
p = ChebyshevInterpolator(sin, 0.0, 1.0, n)
q = chebfit(sin.(chebpoints(n-1, 0, 1)), 0, 1, tol=0)
@btime $p(0.1); @btime $q(0.1);

gives

  508.497 ns (0 allocations: 0 bytes)
  714.659 ns (0 allocations: 0 bytes)

Any ideas?

@stevengj
Copy link
Member

stevengj commented Sep 10, 2021

The following are my inner loop (cheb1) and your inner loop (cheb2). Both give identical results (up to floating-point error), but cheb1 seems to be about 30% slower for n=200:

function cheb1(c, xd, i1=1)
    n = length(c)
    c₁ = c[i1]
    if n  2
        n == 1 && return c₁ + one(xd) * zero(c₁)
        return c₁ + xd*c[i1]
    end
    @inbounds bₖ = c[i1+(n-2)] + 2xd*c[i1+(n-1)]
    @inbounds bₖ₊₁ = oftype(bₖ, c[i1+(n-1)])
    for j = n-3:-1:1
        @inbounds bⱼ = c[i1+j] + 2xd*bₖ - bₖ₊₁
        bₖ, bₖ₊₁ = bⱼ, bₖ
    end
    return c₁ + xd*bₖ - bₖ₊₁
end

function cheb2(a, ξ)
    N = length(a)
    #first two elements of cheby recursion
    Tₖ₋₂ = one(ξ)
    Tₖ₋₁ = ξ
    #first two terms of dot product
    @inbounds y = Tₖ₋₂*a[1] + Tₖ₋₁*a[2]
    #cheby recursion and rest of terms in dot product, all at once
    for k = 3:N
        #next value in recursion
        Tₖ = 2*ξ*Tₖ₋₁ - Tₖ₋₂
        #next term in dot product
        @inbounds y += Tₖ*a[k]
        #swaps
        Tₖ₋₂ = Tₖ₋₁
        Tₖ₋₁ = Tₖ
    end
    return y
end

c = rand(200)
@btime cheb1($c, 0.3);
@btime cheb2($c, 0.3);

gives

  657.621 ns (0 allocations: 0 bytes)
  490.366 ns (0 allocations: 0 bytes)

@stevengj
Copy link
Member

stevengj commented Sep 10, 2021

I wonder if it's just due to the cache effects of iterating over the coefficient array in reverse order (my code) vs forward order (your code). If so, I could fix it just by calling reverse! on my coefficients after they are constructed and then changing my loops accordingly, or alternatively by using a slightly different recurrence more similar to yours.

(My code uses the Clenshaw recurrence, which is analogous to Horner's method — evaluating the polynomial from right to left. Your code evaluates from left to right, computing the Chebyshev polynomial values by the recurrence, which analogous to evaluating 1,x,x²,x³,… iteratively from left to right and multiplying by the coefficients as you go. As far as I know, both methods are backwards stable? They seem to give similar accuracy in practice.)

@stevengj
Copy link
Member

stevengj commented Sep 10, 2021

No, simply reversing the order of iteration (by reversing the coefficient order) in my code speeds things up by 5–10%, but not enough to catch up:

function cheb1r(c, xd)
    n = length(c)
    cₙ = c[n]
    if n  2
        n == 1 && return cₙ + one(xd) * zero(cₙ)
        return cₙ + xd*c[1]
    end
    @inbounds bₖ = c[2] + 2xd*c[1]
    @inbounds bₖ₊₁ = oftype(bₖ, c[1])
    for j = 3:n-1
        @inbounds bⱼ = c[j] + 2xd*bₖ - bₖ₊₁
        bₖ, bₖ₊₁ = bⱼ, bₖ
    end
    return cₙ + xd*bₖ - bₖ₊₁
end

@btime cheb1r($(reverse(c)), 0.3);

gives

608.500 ns (0 allocations: 0 bytes)

for the c = rand(200) as above.

@stevengj
Copy link
Member

In fact, I just noticed that your inner loop requires more floating-point operations (one more multiply) than mine!

(This is why people use Horner's method and Clenshaw recurrences — evaluating from right to left requires fewer multiplications.)

@markmbaum
Copy link
Author

I see, I see... I'm getting roughly the same timing results though. About 540 ns for cheb1r and 423 ns for cheb2.

@stevengj
Copy link
Member

stevengj commented Sep 10, 2021

I think I found the issue: the compiler is apparently not finding a multiply-add opportunity in my code, but maybe found it in yours. This can be fixed by explicitly calling muladd (https://discourse.julialang.org/t/two-nearly-identical-loops-why-is-one-faster/67998).

@markmbaum
Copy link
Author

markmbaum commented Sep 10, 2021

Weird. I was just doing some profiling and found that if you remove the c[j] + part of your loop, the function becomes about 100 ns faster than cheb2. Obviously, you can't do that in practice, but the only difference in profiles was the extra block for + operations in cheb1r. Seems like muladd will fix that.

@markmbaum
Copy link
Author

markmbaum commented Sep 10, 2021

Oh wow, using muladd and the Chenshaw recurrence is actually a lot faster. I'm seeing cheb1r is now about 4x faster than cheb2. I never would have noticed that.

@stevengj
Copy link
Member

stevengj commented Sep 10, 2021

Yes, with the latest commit (3340207) I'm now getting comparable speed for 1d interpolation:
image

though it's still a bit slower for 2d with low orders, maybe due to the overhead of recursion?
image

@markmbaum
Copy link
Author

I'm not sure exactly what's going on in the 2D case. Here the implementations do actually diverge a fair amount. My evaluation function isn't generalized for any number of dimensions and I tried to optimize it for the specific case. When I was first fiddling with it, I realized that most of the work can be reduced to a matrix multiplication, which was noticeably faster than other approaches.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants