Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The natural way of iterating backwards (with stepsize negative one) has bad performance #26770

Closed
KristofferC opened this issue Apr 10, 2018 · 13 comments
Labels
performance Must go faster

Comments

@KristofferC
Copy link
Member

KristofferC commented Apr 10, 2018

# Backwards
function f(x)
    s = 0.0
    @inbounds for j in size(x, 2):-1:1
        for i in size(x, 1):-1:1
            s += x[i, j]
        end
    end
    s
end

# Forwards
function f2(x)
    s = 0.0
    @inbounds for j in 1:size(x, 2)
        for i in 1:size(x, 1)
            s += x[i, j]
        end
    end
    s
end

# Backwards but using Unitranges
function f3(x)
    s = 0.0
    @inbounds for jj in 1:size(x, 2)
        j = size(x, 2) - jj + 1
        for ii in 1:size(x, 1)
            i = size(x, 1) - ii + 1
            s += x[i, j]
        end
    end
    s
end
julia> x = rand(4, 4);

julia> @btime f($x)
  46.343 ns (0 allocations: 0 bytes)
6.9228344175154675

julia> @btime f2($x)
  10.674 ns (0 allocations: 0 bytes)
6.922834417515467

julia> @btime f3($x)
  11.236 ns (0 allocations: 0 bytes)
6.9228344175154675

It would be nice if the natural way of iterating backwards would be as performant as doing it the ugly way.

@KristofferC KristofferC added the performance Must go faster label Apr 10, 2018
@vtjnash
Copy link
Member

vtjnash commented Apr 10, 2018

I think we'll be able to fix this after #25261

@KristofferC
Copy link
Member Author

Looking forward to it!

@thchr
Copy link
Contributor

thchr commented Apr 11, 2018

Not sure if relevant, but the performance penalty in the example above seems to arise only for small matrices x (tested with @belapsed):

compare-iteration-times

Similar behavior is seen for a single loop (e.g. for vectors instead of matrices).
(... tested on v0.6.0, unfortunately)

@KristofferC
Copy link
Member Author

Yes, the overhead comes from creating the StepRange.

@KristofferC
Copy link
Member Author

So this can be fixed now? Ref #26770 (comment)

@KristofferC
Copy link
Member Author

KristofferC commented Oct 20, 2018

Starting writing an issue but realized it was a dup of this one. Here it is:

The following two functions should do the same thing

function copy_second(x, y)
    @inbounds for j in 1:size(y, 2)
        k = 1
        for i in 1:2:size(y, 1)
            x[k, j] = y[i, j]
            k += 1
        end
    end
end

function copy_second_while(x, y)
    @inbounds for j in 1:size(y, 2)
        i = 1; k = 1
        while i <= size(y, 1)
            x[k, j] = y[i, j]
            i+=2
            k+=1
        end
    end
end

Yet:

y = rand(8, 10^5)
x = rand(4, 10^5)

using BenchmarkTools

@btime copy_second(x, y)
  # 1.157 ms (0 allocations: 0 bytes)
@btime copy_second_while(x, y)
  # 356.281 μs (0 allocations: 0 bytes)

The first function has calls to

  %25 = call i64 @julia_steprange_last_198121743(i64 1, i64 2, i64 %15)

which seems intentionally outlined to make the constructor itself inlined. Would be nice to solve this.

@tkf
Copy link
Member

tkf commented Dec 4, 2018

Can we make the creation of StepRange less computation intensive, especially avoid the remainder calculation? For example, you can easily wrap the while loop in StepRange-like interface:

struct PositiveStepRange{T,S}
    start::T
    step::S
    stop::T
end

Base.length(r::PositiveStepRange) = length(StepRange(r.start, r.step, r.stop))

@inline function Base.iterate(r::PositiveStepRange)
    r.start <= r.stop || return nothing
    return (r.start, r.start)
end

@inline function Base.iterate(r::PositiveStepRange, i)
    next = i + r.step
    next <= r.stop || return nothing
    return (next, next)
end

using Test
@testset for i in 1:10
    @test collect(PositiveStepRange(1, i, 5)) == 1:i:5
end

function copy_second(x, y)
    @inbounds for j in 1:size(y, 2)
        k = 1
        for i in 1:2:size(y, 1)
            x[k, j] = y[i, j]
            k += 1
        end
    end
end

function copy_second_while(x, y)
    @inbounds for j in 1:size(y, 2)
        i = 1; k = 1
        while i <= size(y, 1)
            x[k, j] = y[i, j]
            i+=2
            k+=1
        end
    end
end

function copy_second_pos(x, y)
    @inbounds for j in 1:size(y, 2)
        k = 1
        for i in PositiveStepRange(1, 2, size(y, 1))
            x[k, j] = y[i, j]
            k += 1
        end
    end
end

using BenchmarkTools

y = rand(8, 10^5)
x = rand(4, 10^5)

@btime copy_second(x, y)
# 755.079 μs (0 allocations: 0 bytes)
@btime copy_second_while(x, y)
# 362.013 μs (0 allocations: 0 bytes)
@btime copy_second_pos(x, y)
# 358.420 μs (0 allocations: 0 bytes)

Of course, this does not support negative step. I wonder if you can make it work for arbitrary step without sacrificing the long loop performance.

@KristofferC
Copy link
Member Author

Is the crux that we want length to be O(1)?

@timholy
Copy link
Member

timholy commented Dec 4, 2018

Something like

julia> struct FastStepRange{T,S} <: OrdinalRange{T,S}
           start::T
           step::S
           stop::T
       end

julia> @inline function Base.iterate(r::FastStepRange)
           (r.stop - r.start)*r.step >= 0 || return nothing
           return (r.start, r.start)
       end

julia> @inline function Base.iterate(r::FastStepRange, i)
           next = i + r.step
           (r.stop - next)*r.step >= 0 || return nothing
           return (next, next)
       end

might work. Math is typically faster than branching, though of course branch prediction can change that.

The hard part is the set of "corner cases": what do you do with Unsigned? What about integer overflow? You might want to evaluate things like (r.stop > next) ⊻ (r.step >= 0) that may be more robust.

@tkf
Copy link
Member

tkf commented Dec 5, 2018

@KristofferC I'm not sure if I understand your comment. length can be as efficient as current implementation. But we don't need to call it in iterate.

@timholy How about (next - r.stop) * sign(r.step) > 0? I guess it works unless next - r.stop is typemin?

struct FastStepRange{T,S}  # <: OrdinalRange{T,S}
    start::T
    step::S
    stop::T
end

Base.length(r::FastStepRange) = length(StepRange(r.start, r.step, r.stop))

@inline function Base.iterate(r::FastStepRange, i = nothing)
    next = i === nothing ? r.start : i + r.step
    (next - r.stop) * sign(r.step) > 0 && return nothing
    return (next, next)
end

using Test
@testset for i in 1:10
    @test collect(FastStepRange(1, i, 5)) == 1:i:5
    @test collect(FastStepRange(1, -i, 5)) == 1:-i:5
    @test collect(FastStepRange(-1, -i, -5)) == -1:-i:-5
end

function copy_second(x, y)
    @inbounds for j in 1:size(y, 2)
        k = 1
        for i in 1:2:size(y, 1)
            x[k, j] = y[i, j]
            k += 1
        end
    end
end

function copy_second_while(x, y)
    @inbounds for j in 1:size(y, 2)
        i = 1; k = 1
        while i <= size(y, 1)
            x[k, j] = y[i, j]
            i+=2
            k+=1
        end
    end
end

function copy_second_fast(x, y)
    @inbounds for j in 1:size(y, 2)
        k = 1
        for i in FastStepRange(1, 2, size(y, 1))
            x[k, j] = y[i, j]
            k += 1
        end
    end
end

@inline function sum_iter(itr)
    acc = 0
    for i in itr
        acc += i
    end
    return acc
end

@inline sum_fast(n = 10) = sum_iter(FastStepRange(1, 1, n))

using BenchmarkTools

y = rand(8, 10^5)
x = rand(4, 10^5)

@btime copy_second(x, y)
  # 746.105 μs (0 allocations: 0 bytes)
@btime copy_second_while(x, y)
  # 361.389 μs (0 allocations: 0 bytes)
@btime copy_second_fast(x, y)
  # 392.118 μs (0 allocations: 0 bytes)

@btime sum_iter($(1:1:2^15))
  # 8.230 μs (0 allocations: 0 bytes)
@btime sum_iter($(FastStepRange(1, 1, 2^15)))
  # 8.234 μs (0 allocations: 0 bytes)
@btime sum_fast($(2^15))
  # 1.964 ns (0 allocations: 0 bytes)
@btime sum_iter($(1:2^15))
  # 2.213 ns (0 allocations: 0 bytes)

Julia's compiler is really great and it looks like it can eliminate ... * sign(r.step) part when step is a literal and makes it as fast as UnitRange (see sum_fast($(2^15)) vs sum_iter($(1:2^15))).

@tkf
Copy link
Member

tkf commented Dec 5, 2018

length can be as efficient as current implementation.

Actually this was wrong. length (and last etc.) will be slower since it has to calculate the remainder (but still O(1)). But if you use StepRange for a loop you don't need to pay the cost for it.

@KristofferC
Copy link
Member Author

Yeah, sorry, I used the wrong term with O(1).

@laborg
Copy link
Contributor

laborg commented Oct 18, 2023

@KristofferC I think this can be closed:

julia> @btime f($x)
  28.446 ns (0 allocations: 0 bytes)
6.53950829267313

julia> @btime f2($x)
  34.791 ns (0 allocations: 0 bytes)
6.539508292673132

julia> @btime f3($x)
  35.128 ns (0 allocations: 0 bytes)
6.53950829267313
julia> @btime copy_second(x, y)
         # 1.157 ms (0 allocations: 0 bytes)
  525.732 μs (0 allocations: 0 bytes)

julia> @btime copy_second_while(x, y)
         # 356.281 μs (0 allocations: 0 bytes)
  539.890 μs (0 allocations: 0 bytes)

@vtjnash vtjnash closed this as completed Oct 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Must go faster
Projects
None yet
Development

No branches or pull requests

6 participants