Comparison with BasicInterpolators.jl #5

markmbaum · 2021-09-10T13:35:25Z

Hello hello, I'm opening the issue requested here.

I've been working on 1D and 2D Chebyshev interpolators in BasicInterpolators.jl. My code does not generalize to N-dimensions like this project. It appears to be faster, though. Here are the benchmarking code and resulting plots:

using BasicInterpolators, FastChebInterp, BenchmarkTools, Statistics, Plots

##

#numbers of points to test
N = 4:2:32
#timing results
t = zeros(length(N),2) #timing for BasicInterpolators
u = zeros(length(N),2) #timing for FastChebInterp
#test coordinates
xx = rand(1000)
pts = [rand(2) for i ∈ 1:1000]

#one dimensional cases
for (i,n) ∈ enumerate(N)

    #BasicInterpolators 1D
    p = ChebyshevInterpolator(sin, 0.0, 1.0, n)
    b = @benchmark map(x -> $p(x), $xx)
    t[i,1] = mean(b.times)

    #FastChebInterp 1D
    x = chebpoints(n-1, 0, 1)
    q = chebfit(sin.(x), 0.0, 1.0, tol=0.0)
    b = @benchmark map(x -> $q(x), $xx)
    u[i,1] = mean(b.times)

    #BasicInterpolators 2D
    P = BichebyshevInterpolator((x,y)->sin(x)*cos(y), 0, 1, n, 0, 1, n)
    b = @benchmark map(x -> $P.(x,x), $xx)
    t[i,2] = mean(b.times)

    #FastChebInterp 1D
    x = chebpoints(n-1, 0, 1)
    X = x' .* ones(n)
    Y = ones(n)' .* x
    Q = chebfit(sin.(X).*cos.(Y), [0.0,0.0], [1.0,1.0])
    b = @benchmark map(pt -> $Q(pt), $pts)
    u[i,2] = mean(b.times)
end

##

p₁ = plot(N, t[:,1]*1e-6,
    label="BasicInterpolators",
    xlabel="Number of Points",
    ylabel="Time per 1000 interpolations (μs)",
    title="One Dimension")
plot!(p₁, N, u[:,1]*1e-6, label="FastChebInterp")
#savefig(p, "one_dimension")

p₂ = plot(N, t[:,2]*1e-6,
    label="BasicInterpolators",
    xlabel="Number of Points",
    ylabel="Time per 1000 interpolations (μs)",
    title="Two Dimensions")
plot!(p₂, N, u[:,2]*1e-6, label="FastChebInterp")
#savefig(p, "two_dimensions")

The text was updated successfully, but these errors were encountered:

stevengj · 2021-09-10T16:59:07Z

This a bit surprising to me, since our inner loops (for the Clenshaw recurrence) are virtually identical in computational effort, especially in the 1d case. Compare
https://github.com/markmbaum/BasicInterpolators.jl/blob/48d8e0442761744c5447a45e358093244d7636a0/src/chebyshev.jl#L149-L157

with
https://github.com/stevengj/FastChebInterp.jl/blob/fab7e9d4f478f1f37fa9f28c96af72ca9cc40d66/src/eval.jl#L27-L29

Going to a larger n=200, just to eliminate any overheads from argument checks etcetera, I'm still getting a 30% slowdown:

using BasicInterpolators, FastChebInterp, BenchmarkTools
n = 200
p = ChebyshevInterpolator(sin, 0.0, 1.0, n)
q = chebfit(sin.(chebpoints(n-1, 0, 1)), 0, 1, tol=0)
@btime $p(0.1); @btime $q(0.1);

gives

  508.497 ns (0 allocations: 0 bytes)
  714.659 ns (0 allocations: 0 bytes)

Any ideas?

stevengj · 2021-09-10T17:53:44Z

The following are my inner loop (cheb1) and your inner loop (cheb2). Both give identical results (up to floating-point error), but cheb1 seems to be about 30% slower for n=200:

function cheb1(c, xd, i1=1)
    n = length(c)
    c₁ = c[i1]
    if n ≤ 2
        n == 1 && return c₁ + one(xd) * zero(c₁)
        return c₁ + xd*c[i1]
    end
    @inbounds bₖ = c[i1+(n-2)] + 2xd*c[i1+(n-1)]
    @inbounds bₖ₊₁ = oftype(bₖ, c[i1+(n-1)])
    for j = n-3:-1:1
        @inbounds bⱼ = c[i1+j] + 2xd*bₖ - bₖ₊₁
        bₖ, bₖ₊₁ = bⱼ, bₖ
    end
    return c₁ + xd*bₖ - bₖ₊₁
end

function cheb2(a, ξ)
    N = length(a)
    #first two elements of cheby recursion
    Tₖ₋₂ = one(ξ)
    Tₖ₋₁ = ξ
    #first two terms of dot product
    @inbounds y = Tₖ₋₂*a[1] + Tₖ₋₁*a[2]
    #cheby recursion and rest of terms in dot product, all at once
    for k = 3:N
        #next value in recursion
        Tₖ = 2*ξ*Tₖ₋₁ - Tₖ₋₂
        #next term in dot product
        @inbounds y += Tₖ*a[k]
        #swaps
        Tₖ₋₂ = Tₖ₋₁
        Tₖ₋₁ = Tₖ
    end
    return y
end

c = rand(200)
@btime cheb1($c, 0.3);
@btime cheb2($c, 0.3);

gives

  657.621 ns (0 allocations: 0 bytes)
  490.366 ns (0 allocations: 0 bytes)

stevengj · 2021-09-10T17:58:09Z

I wonder if it's just due to the cache effects of iterating over the coefficient array in reverse order (my code) vs forward order (your code). If so, I could fix it just by calling reverse! on my coefficients after they are constructed and then changing my loops accordingly, or alternatively by using a slightly different recurrence more similar to yours.

(My code uses the Clenshaw recurrence, which is analogous to Horner's method — evaluating the polynomial from right to left. Your code evaluates from left to right, computing the Chebyshev polynomial values by the recurrence, which analogous to evaluating 1,x,x²,x³,… iteratively from left to right and multiplying by the coefficients as you go. As far as I know, both methods are backwards stable? They seem to give similar accuracy in practice.)

stevengj · 2021-09-10T18:16:35Z

No, simply reversing the order of iteration (by reversing the coefficient order) in my code speeds things up by 5–10%, but not enough to catch up:

function cheb1r(c, xd)
    n = length(c)
    cₙ = c[n]
    if n ≤ 2
        n == 1 && return cₙ + one(xd) * zero(cₙ)
        return cₙ + xd*c[1]
    end
    @inbounds bₖ = c[2] + 2xd*c[1]
    @inbounds bₖ₊₁ = oftype(bₖ, c[1])
    for j = 3:n-1
        @inbounds bⱼ = c[j] + 2xd*bₖ - bₖ₊₁
        bₖ, bₖ₊₁ = bⱼ, bₖ
    end
    return cₙ + xd*bₖ - bₖ₊₁
end

@btime cheb1r($(reverse(c)), 0.3);

gives

608.500 ns (0 allocations: 0 bytes)

for the c = rand(200) as above.

stevengj · 2021-09-10T18:35:57Z

In fact, I just noticed that your inner loop requires more floating-point operations (one more multiply) than mine!

(This is why people use Horner's method and Clenshaw recurrences — evaluating from right to left requires fewer multiplications.)

markmbaum · 2021-09-10T18:44:44Z

I see, I see... I'm getting roughly the same timing results though. About 540 ns for cheb1r and 423 ns for cheb2.

stevengj · 2021-09-10T18:46:13Z

I think I found the issue: the compiler is apparently not finding a multiply-add opportunity in my code, but maybe found it in yours. This can be fixed by explicitly calling muladd (https://discourse.julialang.org/t/two-nearly-identical-loops-why-is-one-faster/67998).

markmbaum · 2021-09-10T18:50:19Z

Weird. I was just doing some profiling and found that if you remove the c[j] + part of your loop, the function becomes about 100 ns faster than cheb2. Obviously, you can't do that in practice, but the only difference in profiles was the extra block for + operations in cheb1r. Seems like muladd will fix that.

markmbaum · 2021-09-10T18:52:30Z

Oh wow, using muladd and the Chenshaw recurrence is actually a lot faster. I'm seeing cheb1r is now about 4x faster than cheb2. I never would have noticed that.

stevengj · 2021-09-10T19:20:54Z

Yes, with the latest commit (3340207) I'm now getting comparable speed for 1d interpolation:

though it's still a bit slower for 2d with low orders, maybe due to the overhead of recursion?

markmbaum · 2021-09-11T19:16:36Z

I'm not sure exactly what's going on in the 2D case. Here the implementations do actually diverge a fair amount. My evaluation function isn't generalized for any number of dimensions and I tried to optimize it for the specific case. When I was first fiddling with it, I realized that most of the work can be reduced to a matrix multiplication, which was noticeably faster than other approaches.

markmbaum closed this as completed Sep 10, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comparison with BasicInterpolators.jl #5

Comparison with BasicInterpolators.jl #5

markmbaum commented Sep 10, 2021 •

edited

Loading

stevengj commented Sep 10, 2021 •

edited

Loading

stevengj commented Sep 10, 2021 •

edited

Loading

stevengj commented Sep 10, 2021 •

edited

Loading

stevengj commented Sep 10, 2021 •

edited

Loading

stevengj commented Sep 10, 2021

markmbaum commented Sep 10, 2021

stevengj commented Sep 10, 2021 •

edited

Loading

markmbaum commented Sep 10, 2021 •

edited

Loading

markmbaum commented Sep 10, 2021 •

edited

Loading

stevengj commented Sep 10, 2021 •

edited

Loading

markmbaum commented Sep 11, 2021

Comparison with BasicInterpolators.jl #5

Comparison with BasicInterpolators.jl #5

Comments

markmbaum commented Sep 10, 2021 • edited Loading

stevengj commented Sep 10, 2021 • edited Loading

stevengj commented Sep 10, 2021 • edited Loading

stevengj commented Sep 10, 2021 • edited Loading

stevengj commented Sep 10, 2021 • edited Loading

stevengj commented Sep 10, 2021

markmbaum commented Sep 10, 2021

stevengj commented Sep 10, 2021 • edited Loading

markmbaum commented Sep 10, 2021 • edited Loading

markmbaum commented Sep 10, 2021 • edited Loading

stevengj commented Sep 10, 2021 • edited Loading

markmbaum commented Sep 11, 2021

markmbaum commented Sep 10, 2021 •

edited

Loading

stevengj commented Sep 10, 2021 •

edited

Loading

stevengj commented Sep 10, 2021 •

edited

Loading

stevengj commented Sep 10, 2021 •

edited

Loading

stevengj commented Sep 10, 2021 •

edited

Loading

stevengj commented Sep 10, 2021 •

edited

Loading

markmbaum commented Sep 10, 2021 •

edited

Loading

markmbaum commented Sep 10, 2021 •

edited

Loading

stevengj commented Sep 10, 2021 •

edited

Loading