The natural way of iterating backwards (with stepsize negative one) has bad performance #26770

KristofferC · 2018-04-10T15:14:08Z

# Backwards
function f(x)
    s = 0.0
    @inbounds for j in size(x, 2):-1:1
        for i in size(x, 1):-1:1
            s += x[i, j]
        end
    end
    s
end

# Forwards
function f2(x)
    s = 0.0
    @inbounds for j in 1:size(x, 2)
        for i in 1:size(x, 1)
            s += x[i, j]
        end
    end
    s
end

# Backwards but using Unitranges
function f3(x)
    s = 0.0
    @inbounds for jj in 1:size(x, 2)
        j = size(x, 2) - jj + 1
        for ii in 1:size(x, 1)
            i = size(x, 1) - ii + 1
            s += x[i, j]
        end
    end
    s
end

julia> x = rand(4, 4);

julia> @btime f($x)
  46.343 ns (0 allocations: 0 bytes)
6.9228344175154675

julia> @btime f2($x)
  10.674 ns (0 allocations: 0 bytes)
6.922834417515467

julia> @btime f3($x)
  11.236 ns (0 allocations: 0 bytes)
6.9228344175154675

It would be nice if the natural way of iterating backwards would be as performant as doing it the ugly way.

vtjnash · 2018-04-10T20:51:16Z

I think we'll be able to fix this after #25261

KristofferC · 2018-04-10T21:40:53Z

Looking forward to it!

thchr · 2018-04-11T21:45:04Z

Not sure if relevant, but the performance penalty in the example above seems to arise only for small matrices x (tested with @belapsed):

Similar behavior is seen for a single loop (e.g. for vectors instead of matrices).
(... tested on v0.6.0, unfortunately)

KristofferC · 2018-04-12T05:37:22Z

Yes, the overhead comes from creating the StepRange.

KristofferC · 2018-05-21T11:38:40Z

So this can be fixed now? Ref #26770 (comment)

KristofferC · 2018-10-20T15:00:57Z

Starting writing an issue but realized it was a dup of this one. Here it is:

The following two functions should do the same thing

function copy_second(x, y)
    @inbounds for j in 1:size(y, 2)
        k = 1
        for i in 1:2:size(y, 1)
            x[k, j] = y[i, j]
            k += 1
        end
    end
end

function copy_second_while(x, y)
    @inbounds for j in 1:size(y, 2)
        i = 1; k = 1
        while i <= size(y, 1)
            x[k, j] = y[i, j]
            i+=2
            k+=1
        end
    end
end

Yet:

y = rand(8, 10^5)
x = rand(4, 10^5)

using BenchmarkTools

@btime copy_second(x, y)
  # 1.157 ms (0 allocations: 0 bytes)
@btime copy_second_while(x, y)
  # 356.281 μs (0 allocations: 0 bytes)

The first function has calls to

  %25 = call i64 @julia_steprange_last_198121743(i64 1, i64 2, i64 %15)

which seems intentionally outlined to make the constructor itself inlined. Would be nice to solve this.

tkf · 2018-12-04T22:22:59Z

Can we make the creation of StepRange less computation intensive, especially avoid the remainder calculation? For example, you can easily wrap the while loop in StepRange-like interface:

struct PositiveStepRange{T,S}
    start::T
    step::S
    stop::T
end

Base.length(r::PositiveStepRange) = length(StepRange(r.start, r.step, r.stop))

@inline function Base.iterate(r::PositiveStepRange)
    r.start <= r.stop || return nothing
    return (r.start, r.start)
end

@inline function Base.iterate(r::PositiveStepRange, i)
    next = i + r.step
    next <= r.stop || return nothing
    return (next, next)
end

using Test
@testset for i in 1:10
    @test collect(PositiveStepRange(1, i, 5)) == 1:i:5
end

function copy_second(x, y)
    @inbounds for j in 1:size(y, 2)
        k = 1
        for i in 1:2:size(y, 1)
            x[k, j] = y[i, j]
            k += 1
        end
    end
end

function copy_second_while(x, y)
    @inbounds for j in 1:size(y, 2)
        i = 1; k = 1
        while i <= size(y, 1)
            x[k, j] = y[i, j]
            i+=2
            k+=1
        end
    end
end

function copy_second_pos(x, y)
    @inbounds for j in 1:size(y, 2)
        k = 1
        for i in PositiveStepRange(1, 2, size(y, 1))
            x[k, j] = y[i, j]
            k += 1
        end
    end
end

using BenchmarkTools

y = rand(8, 10^5)
x = rand(4, 10^5)

@btime copy_second(x, y)
# 755.079 μs (0 allocations: 0 bytes)
@btime copy_second_while(x, y)
# 362.013 μs (0 allocations: 0 bytes)
@btime copy_second_pos(x, y)
# 358.420 μs (0 allocations: 0 bytes)

Of course, this does not support negative step. I wonder if you can make it work for arbitrary step without sacrificing the long loop performance.

KristofferC · 2018-12-04T22:58:21Z

Is the crux that we want length to be O(1)?

timholy · 2018-12-04T23:00:54Z

Something like

julia> struct FastStepRange{T,S} <: OrdinalRange{T,S}
           start::T
           step::S
           stop::T
       end

julia> @inline function Base.iterate(r::FastStepRange)
           (r.stop - r.start)*r.step >= 0 || return nothing
           return (r.start, r.start)
       end

julia> @inline function Base.iterate(r::FastStepRange, i)
           next = i + r.step
           (r.stop - next)*r.step >= 0 || return nothing
           return (next, next)
       end

might work. Math is typically faster than branching, though of course branch prediction can change that.

The hard part is the set of "corner cases": what do you do with Unsigned? What about integer overflow? You might want to evaluate things like (r.stop > next) ⊻ (r.step >= 0) that may be more robust.

tkf · 2018-12-05T01:15:40Z

@KristofferC I'm not sure if I understand your comment. length can be as efficient as current implementation. But we don't need to call it in iterate.

@timholy How about (next - r.stop) * sign(r.step) > 0? I guess it works unless next - r.stop is typemin?

struct FastStepRange{T,S}  # <: OrdinalRange{T,S}
    start::T
    step::S
    stop::T
end

Base.length(r::FastStepRange) = length(StepRange(r.start, r.step, r.stop))

@inline function Base.iterate(r::FastStepRange, i = nothing)
    next = i === nothing ? r.start : i + r.step
    (next - r.stop) * sign(r.step) > 0 && return nothing
    return (next, next)
end

using Test
@testset for i in 1:10
    @test collect(FastStepRange(1, i, 5)) == 1:i:5
    @test collect(FastStepRange(1, -i, 5)) == 1:-i:5
    @test collect(FastStepRange(-1, -i, -5)) == -1:-i:-5
end

function copy_second(x, y)
    @inbounds for j in 1:size(y, 2)
        k = 1
        for i in 1:2:size(y, 1)
            x[k, j] = y[i, j]
            k += 1
        end
    end
end

function copy_second_while(x, y)
    @inbounds for j in 1:size(y, 2)
        i = 1; k = 1
        while i <= size(y, 1)
            x[k, j] = y[i, j]
            i+=2
            k+=1
        end
    end
end

function copy_second_fast(x, y)
    @inbounds for j in 1:size(y, 2)
        k = 1
        for i in FastStepRange(1, 2, size(y, 1))
            x[k, j] = y[i, j]
            k += 1
        end
    end
end

@inline function sum_iter(itr)
    acc = 0
    for i in itr
        acc += i
    end
    return acc
end

@inline sum_fast(n = 10) = sum_iter(FastStepRange(1, 1, n))

using BenchmarkTools

y = rand(8, 10^5)
x = rand(4, 10^5)

@btime copy_second(x, y)
  # 746.105 μs (0 allocations: 0 bytes)
@btime copy_second_while(x, y)
  # 361.389 μs (0 allocations: 0 bytes)
@btime copy_second_fast(x, y)
  # 392.118 μs (0 allocations: 0 bytes)

@btime sum_iter($(1:1:2^15))
  # 8.230 μs (0 allocations: 0 bytes)
@btime sum_iter($(FastStepRange(1, 1, 2^15)))
  # 8.234 μs (0 allocations: 0 bytes)
@btime sum_fast($(2^15))
  # 1.964 ns (0 allocations: 0 bytes)
@btime sum_iter($(1:2^15))
  # 2.213 ns (0 allocations: 0 bytes)

Julia's compiler is really great and it looks like it can eliminate ... * sign(r.step) part when step is a literal and makes it as fast as UnitRange (see sum_fast($(2^15)) vs sum_iter($(1:2^15))).

tkf · 2018-12-05T02:05:44Z

length can be as efficient as current implementation.

Actually this was wrong. length (and last etc.) will be slower since it has to calculate the remainder (but still O(1)). But if you use StepRange for a loop you don't need to pay the cost for it.

KristofferC · 2018-12-05T02:20:41Z

Yeah, sorry, I used the wrong term with O(1).

laborg · 2023-10-18T05:36:20Z

@KristofferC I think this can be closed:

julia> @btime f($x)
  28.446 ns (0 allocations: 0 bytes)
6.53950829267313

julia> @btime f2($x)
  34.791 ns (0 allocations: 0 bytes)
6.539508292673132

julia> @btime f3($x)
  35.128 ns (0 allocations: 0 bytes)
6.53950829267313

julia> @btime copy_second(x, y)
         # 1.157 ms (0 allocations: 0 bytes)
  525.732 μs (0 allocations: 0 bytes)

julia> @btime copy_second_while(x, y)
         # 356.281 μs (0 allocations: 0 bytes)
  539.890 μs (0 allocations: 0 bytes)

KristofferC added the performance Must go faster label Apr 10, 2018

tkoolen mentioned this issue Jun 12, 2018

Improve support for ranges with nonstandard Integers #27302

Merged

KristofferC mentioned this issue Dec 4, 2018

Add gather eschnett/SIMD.jl#37

Merged

KristofferC mentioned this issue Feb 23, 2020

Not seeing the performance gains of SIMD eschnett/SIMD.jl#20

Closed

KristofferC mentioned this issue Mar 5, 2020

range function not eliding with constant arguments #35022

Closed

KristofferC mentioned this issue Apr 6, 2023

Inline StepRange construction #49270

Merged

vtjnash closed this as completed Oct 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The natural way of iterating backwards (with stepsize negative one) has bad performance #26770

The natural way of iterating backwards (with stepsize negative one) has bad performance #26770

KristofferC commented Apr 10, 2018 •

edited

Loading

vtjnash commented Apr 10, 2018

KristofferC commented Apr 10, 2018

thchr commented Apr 11, 2018

KristofferC commented Apr 12, 2018

KristofferC commented May 21, 2018

KristofferC commented Oct 20, 2018 •

edited

Loading

tkf commented Dec 4, 2018

KristofferC commented Dec 4, 2018

timholy commented Dec 4, 2018

tkf commented Dec 5, 2018 •

edited

Loading

tkf commented Dec 5, 2018

KristofferC commented Dec 5, 2018

laborg commented Oct 18, 2023

The natural way of iterating backwards (with stepsize negative one) has bad performance #26770

The natural way of iterating backwards (with stepsize negative one) has bad performance #26770

Comments

KristofferC commented Apr 10, 2018 • edited Loading

vtjnash commented Apr 10, 2018

KristofferC commented Apr 10, 2018

thchr commented Apr 11, 2018

KristofferC commented Apr 12, 2018

KristofferC commented May 21, 2018

KristofferC commented Oct 20, 2018 • edited Loading

tkf commented Dec 4, 2018

KristofferC commented Dec 4, 2018

timholy commented Dec 4, 2018

tkf commented Dec 5, 2018 • edited Loading

tkf commented Dec 5, 2018

KristofferC commented Dec 5, 2018

laborg commented Oct 18, 2023

KristofferC commented Apr 10, 2018 •

edited

Loading

KristofferC commented Oct 20, 2018 •

edited

Loading

tkf commented Dec 5, 2018 •

edited

Loading