Could `@inbounds` communicate its (removed) invariants to LLVM? #39340

yurivish · 2021-01-20T22:17:55Z

@mcabbott and I were investigating the performance effect of @inbounds annotations inside eachcol and found some puzzling behavior with a simple test function.

Using the following definitions

julia> myeachcol(A::AbstractVecOrMat) = ((@inbounds view(A, :, i)) for i in axes(A, 2));

julia> test(xs, eachcol) = sum(sum(x) for x in eachcol(xs));

julia> xs = rand(2, 10^7);

I measured the performance of test using eachcol and myeachcol both using @time and @btime:

julia> using BenchmarkTools

julia> test(xs, eachcol); test(xs, myeachcol); # warmup

julia> @time test(xs, eachcol);
  0.035938 seconds (5 allocations: 112 bytes)

julia> @time test(xs, myeachcol);
  0.045531 seconds (5 allocations: 112 bytes)

julia> @btime test(xs, eachcol);
  26.292 ms (5 allocations: 112 bytes)

julia> @btime test(xs, myeachcol);
  32.997 ms (5 allocations: 112 bytes)

To my surprise the function with @inbounds annotation is slower than the Base function without it.

The generated assembly (output of @code_native) is the same on my mac for the two functions modulo line number comments [edit: this part is not relevant, see @kimikage 's comment below; the actual processing is done in mapfold_impl]:

Here's a more complete set of statistics reported by @benchmark:

julia> @benchmark test(xs, eachcol)
BenchmarkTools.Trial:
  memory estimate:  112 bytes
  allocs estimate:  5
  --------------
  minimum time:     26.113 ms (0.00% GC)
  median time:      31.321 ms (0.00% GC)
  mean time:        29.851 ms (0.00% GC)
  maximum time:     33.744 ms (0.00% GC)
  --------------
  samples:          168
  evals/sample:     1

julia> @benchmark test(xs, myeachcol)
BenchmarkTools.Trial:
  memory estimate:  112 bytes
  allocs estimate:  5
  --------------
  minimum time:     33.330 ms (0.00% GC)
  median time:      33.932 ms (0.00% GC)
  mean time:        34.429 ms (0.00% GC)
  maximum time:     42.041 ms (0.00% GC)
  --------------
  samples:          146
  evals/sample:     1

julia> versioninfo()
Julia Version 1.5.3
Commit 788b2c77c1 (2020-11-09 13:37 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin18.7.0)
  CPU: Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-9.0.1 (ORCJIT, skylake)

The Base eachcol definition in Julia 1.5.3 in abstractarraymath.jl:479 is:

eachcol(A::AbstractVecOrMat) = (view(A, :, i) for i in axes(A, 2))

The text was updated successfully, but these errors were encountered:

kimikage · 2021-01-21T04:24:53Z

Since (my)eachcol returns a generator, the actual processing should be applied in the mapfold_impl.

BTW, interestingly, my Windows machine gave a different result.

julia> versioninfo()
Julia Version 1.5.3
Commit 788b2c77c1 (2020-11-09 13:37 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: Intel(R) Core(TM) i7-8565U CPU @ 1.80GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-9.0.1 (ORCJIT, skylake)

julia> @btime test(xs, eachcol);
  33.650 ms (5 allocations: 112 bytes)

julia> @btime test(xs, myeachcol);
  30.425 ms (5 allocations: 112 bytes)

yurivish · 2021-01-21T04:41:09Z

Since (my)eachcol returns a generator, the actual processing should be applied in the mapfold_impl.

Doh, I should have seen that in the generated code. I'll rename the issue to be about the @inbounds performance difference.

BTW, interestingly, my Windows machine gave a different result.

Interesting! As an additional datapoint I just tested on a mid-2014 Macbook and macOS and see the same performance split with myeachcol slower than eachcol.

kimikage · 2021-01-21T06:21:45Z

Debian on WSL2 on the same Windows machine mentioned above had the same trending results as your Macbook.

julia> versioninfo()
Julia Version 1.5.3
Commit 788b2c77c1 (2020-11-09 13:37 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Core(TM) i7-8565U CPU @ 1.80GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-9.0.1 (ORCJIT, skylake)

julia> @btime test(xs, eachcol);
  28.689 ms (5 allocations: 112 bytes)

julia> @btime test(xs, myeachcol);
  36.112 ms (5 allocations: 112 bytes)

maleadt · 2021-01-21T13:56:32Z

#26261

yurivish · 2021-04-07T19:42:14Z

Why was this issue closed?

I re-ran the measurements from the original post and the results still hold on Julia 1.6 on macOS:

julia> myeachcol(A::AbstractVecOrMat) = ((@inbounds view(A, :, i)) for i in axes(A, 2));

julia> test(xs, eachcol) = sum(sum(x) for x in eachcol(xs));

julia> xs = rand(2, 10^7);

julia> using BenchmarkTools

julia> test(xs, eachcol); test(xs, myeachcol); # warmup

julia>

julia> @time test(xs, eachcol);
  0.036498 seconds (5 allocations: 112 bytes)

julia> @time test(xs, myeachcol);
  0.039457 seconds (5 allocations: 112 bytes)

julia> @btime test(xs, eachcol);
  27.893 ms (5 allocations: 112 bytes)

julia> @btime test(xs, myeachcol);
  32.159 ms (5 allocations: 112 bytes)

mbauman · 2021-04-07T20:40:29Z

So, you're effectively creating a nested for loop:

for i in axes(A, 2)
    v = #= maybe @inbounds =# view(A, :, i)
    inner_result = 0.0
    for j in eachindex(v)
       inner_result += v[j]
    end
    result += inner_result
end

It's not terribly surprising that a boundscheck on the view in the outer loop could help Julia figure out it's safe to remove boundschecks in the inner one. Indeed:

julia> xs = rand(2, 10^7);

julia> function f(A)
           result = 0.0
           for i in axes(A, 2)
               v = view(A, :, i)
               inner_result = 0.0
               for j in eachindex(v)
                   inner_result += v[j]
               end
               result += inner_result
           end
           return result
       end
f (generic function with 2 methods)

julia> function g(A)
           result = 0.0
           for i in axes(A, 2)
               v = @inbounds view(A, :, i)
               inner_result = 0.0
               for j in eachindex(v)
                   inner_result += v[j]
               end
               result += inner_result
           end
           return result
       end
g (generic function with 2 methods)

julia> @btime f($xs)
  13.590 ms (0 allocations: 0 bytes)
9.999553205385104e6

julia> @btime g($xs)
  18.410 ms (0 allocations: 0 bytes)
9.999553205385104e6

yurivish · 2021-04-07T21:23:18Z

Interesting, thanks for the explanation. It makes sense to me now that @inbounds can decrease performance, but I suspect that this is not intuitive to many users of @inbounds since there's an intuition that "giving the compiler more information" should never make things worse.

This issue was opened because someone noticed the slowdown and was surprised, and #26261 contains this quote from someone else who was also surprised:

[...] decided I would try and sprinkle some of the new Julia magic speed dust on it such as @inbounds, and was surprised to find that it actually made the whole thing slower:

Should this issue stay open to track documentation of the existing behavior?

StefanKarpinski · 2021-04-09T20:02:51Z

It seemed to me that @maleadt's comment indicated that this issue was fixed, so I closed it.

yurivish changed the title ~~Performance difference with identical generated code~~ Performance difference with identical native code Jan 20, 2021

yurivish changed the title ~~Performance difference with identical native code~~ Code with @inbounds can be slower than code without it Jan 21, 2021

NHDaly mentioned this issue Jan 23, 2021

eachcol does not propagate inbounds #39325

Open

NHDaly added arrays [a, r, r, a, y, s] performance Must go faster labels Jan 23, 2021

StefanKarpinski closed this as completed Apr 7, 2021

mbauman changed the title ~~Code with @inbounds can be slower than code without it~~ Could @inbounds communicate its (removed) invariants to LLVM? Apr 7, 2021

mbauman added the speculative Whether the change will be implemented is speculative label Apr 7, 2021

mbauman reopened this Apr 7, 2021

wheeheee mentioned this issue Mar 12, 2024

@inbounds / proving inbounds inhibits loop vectorization #53705

Open

nsajko mentioned this issue Apr 22, 2024

An intrinisic/function/macro like assume #52851

Open

mbauman mentioned this issue Jan 31, 2025

Consider removing --check-bounds=no? #48245

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Could `@inbounds` communicate its (removed) invariants to LLVM? #39340

Could `@inbounds` communicate its (removed) invariants to LLVM? #39340

yurivish commented Jan 20, 2021 •

edited

Loading

kimikage commented Jan 21, 2021

yurivish commented Jan 21, 2021 •

edited

Loading

kimikage commented Jan 21, 2021

maleadt commented Jan 21, 2021

yurivish commented Apr 7, 2021 •

edited

Loading

mbauman commented Apr 7, 2021 •

edited

Loading

yurivish commented Apr 7, 2021

StefanKarpinski commented Apr 9, 2021

Could @inbounds communicate its (removed) invariants to LLVM? #39340

Could @inbounds communicate its (removed) invariants to LLVM? #39340

Comments

yurivish commented Jan 20, 2021 • edited Loading

kimikage commented Jan 21, 2021

yurivish commented Jan 21, 2021 • edited Loading

kimikage commented Jan 21, 2021

maleadt commented Jan 21, 2021

yurivish commented Apr 7, 2021 • edited Loading

mbauman commented Apr 7, 2021 • edited Loading

yurivish commented Apr 7, 2021

StefanKarpinski commented Apr 9, 2021

Could `@inbounds` communicate its (removed) invariants to LLVM? #39340

Could `@inbounds` communicate its (removed) invariants to LLVM? #39340

yurivish commented Jan 20, 2021 •

edited

Loading

yurivish commented Jan 21, 2021 •

edited

Loading

yurivish commented Apr 7, 2021 •

edited

Loading

mbauman commented Apr 7, 2021 •

edited

Loading