Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Could @inbounds communicate its (removed) invariants to LLVM? #39340

Open
yurivish opened this issue Jan 20, 2021 · 8 comments
Open

Could @inbounds communicate its (removed) invariants to LLVM? #39340

yurivish opened this issue Jan 20, 2021 · 8 comments
Labels
arrays [a, r, r, a, y, s] performance Must go faster speculative Whether the change will be implemented is speculative

Comments

@yurivish
Copy link
Contributor

yurivish commented Jan 20, 2021

@mcabbott and I were investigating the performance effect of @inbounds annotations inside eachcol and found some puzzling behavior with a simple test function.

Using the following definitions

julia> myeachcol(A::AbstractVecOrMat) = ((@inbounds view(A, :, i)) for i in axes(A, 2));

julia> test(xs, eachcol) = sum(sum(x) for x in eachcol(xs));

julia> xs = rand(2, 10^7);

I measured the performance of test using eachcol and myeachcol both using @time and @btime:

julia> using BenchmarkTools

julia> test(xs, eachcol); test(xs, myeachcol); # warmup

julia> @time test(xs, eachcol);
  0.035938 seconds (5 allocations: 112 bytes)

julia> @time test(xs, myeachcol);
  0.045531 seconds (5 allocations: 112 bytes)

julia> @btime test(xs, eachcol);
  26.292 ms (5 allocations: 112 bytes)

julia> @btime test(xs, myeachcol);
  32.997 ms (5 allocations: 112 bytes)

To my surprise the function with @inbounds annotation is slower than the Base function without it.

The generated assembly (output of @code_native) is the same on my mac for the two functions modulo line number comments [edit: this part is not relevant, see @kimikage 's comment below; the actual processing is done in mapfold_impl]:

image

Here's a more complete set of statistics reported by @benchmark:

julia> @benchmark test(xs, eachcol)
BenchmarkTools.Trial:
  memory estimate:  112 bytes
  allocs estimate:  5
  --------------
  minimum time:     26.113 ms (0.00% GC)
  median time:      31.321 ms (0.00% GC)
  mean time:        29.851 ms (0.00% GC)
  maximum time:     33.744 ms (0.00% GC)
  --------------
  samples:          168
  evals/sample:     1

julia> @benchmark test(xs, myeachcol)
BenchmarkTools.Trial:
  memory estimate:  112 bytes
  allocs estimate:  5
  --------------
  minimum time:     33.330 ms (0.00% GC)
  median time:      33.932 ms (0.00% GC)
  mean time:        34.429 ms (0.00% GC)
  maximum time:     42.041 ms (0.00% GC)
  --------------
  samples:          146
  evals/sample:     1

julia> versioninfo()
Julia Version 1.5.3
Commit 788b2c77c1 (2020-11-09 13:37 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin18.7.0)
  CPU: Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-9.0.1 (ORCJIT, skylake)

The Base eachcol definition in Julia 1.5.3 in abstractarraymath.jl:479 is:

eachcol(A::AbstractVecOrMat) = (view(A, :, i) for i in axes(A, 2))
@yurivish yurivish changed the title Performance difference with identical generated code Performance difference with identical native code Jan 20, 2021
@kimikage
Copy link
Contributor

Since (my)eachcol returns a generator, the actual processing should be applied in the mapfold_impl.

BTW, interestingly, my Windows machine gave a different result.

julia> versioninfo()
Julia Version 1.5.3
Commit 788b2c77c1 (2020-11-09 13:37 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: Intel(R) Core(TM) i7-8565U CPU @ 1.80GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-9.0.1 (ORCJIT, skylake)

julia> @btime test(xs, eachcol);
  33.650 ms (5 allocations: 112 bytes)

julia> @btime test(xs, myeachcol);
  30.425 ms (5 allocations: 112 bytes)

@yurivish
Copy link
Contributor Author

yurivish commented Jan 21, 2021

Since (my)eachcol returns a generator, the actual processing should be applied in the mapfold_impl.

Doh, I should have seen that in the generated code. I'll rename the issue to be about the @inbounds performance difference.

BTW, interestingly, my Windows machine gave a different result.

Interesting! As an additional datapoint I just tested on a mid-2014 Macbook and macOS and see the same performance split with myeachcol slower than eachcol.

@yurivish yurivish changed the title Performance difference with identical native code Code with @inbounds can be slower than code without it Jan 21, 2021
@kimikage
Copy link
Contributor

Debian on WSL2 on the same Windows machine mentioned above had the same trending results as your Macbook.

julia> versioninfo()
Julia Version 1.5.3
Commit 788b2c77c1 (2020-11-09 13:37 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Core(TM) i7-8565U CPU @ 1.80GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-9.0.1 (ORCJIT, skylake)

julia> @btime test(xs, eachcol);
  28.689 ms (5 allocations: 112 bytes)

julia> @btime test(xs, myeachcol);
  36.112 ms (5 allocations: 112 bytes)

@maleadt
Copy link
Member

maleadt commented Jan 21, 2021

#26261

@yurivish
Copy link
Contributor Author

yurivish commented Apr 7, 2021

Why was this issue closed?

I re-ran the measurements from the original post and the results still hold on Julia 1.6 on macOS:

julia> myeachcol(A::AbstractVecOrMat) = ((@inbounds view(A, :, i)) for i in axes(A, 2));

julia> test(xs, eachcol) = sum(sum(x) for x in eachcol(xs));

julia> xs = rand(2, 10^7);

julia> using BenchmarkTools

julia> test(xs, eachcol); test(xs, myeachcol); # warmup

julia>

julia> @time test(xs, eachcol);
  0.036498 seconds (5 allocations: 112 bytes)

julia> @time test(xs, myeachcol);
  0.039457 seconds (5 allocations: 112 bytes)

julia> @btime test(xs, eachcol);
  27.893 ms (5 allocations: 112 bytes)

julia> @btime test(xs, myeachcol);
  32.159 ms (5 allocations: 112 bytes)

@mbauman
Copy link
Member

mbauman commented Apr 7, 2021

So, you're effectively creating a nested for loop:

for i in axes(A, 2)
    v = #= maybe @inbounds =# view(A, :, i)
    inner_result = 0.0
    for j in eachindex(v)
       inner_result += v[j]
    end
    result += inner_result
end

It's not terribly surprising that a boundscheck on the view in the outer loop could help Julia figure out it's safe to remove boundschecks in the inner one. Indeed:

julia> xs = rand(2, 10^7);

julia> function f(A)
           result = 0.0
           for i in axes(A, 2)
               v = view(A, :, i)
               inner_result = 0.0
               for j in eachindex(v)
                   inner_result += v[j]
               end
               result += inner_result
           end
           return result
       end
f (generic function with 2 methods)

julia> function g(A)
           result = 0.0
           for i in axes(A, 2)
               v = @inbounds view(A, :, i)
               inner_result = 0.0
               for j in eachindex(v)
                   inner_result += v[j]
               end
               result += inner_result
           end
           return result
       end
g (generic function with 2 methods)

julia> @btime f($xs)
  13.590 ms (0 allocations: 0 bytes)
9.999553205385104e6

julia> @btime g($xs)
  18.410 ms (0 allocations: 0 bytes)
9.999553205385104e6

@yurivish
Copy link
Contributor Author

yurivish commented Apr 7, 2021

Interesting, thanks for the explanation. It makes sense to me now that @inbounds can decrease performance, but I suspect that this is not intuitive to many users of @inbounds since there's an intuition that "giving the compiler more information" should never make things worse.

This issue was opened because someone noticed the slowdown and was surprised, and #26261 contains this quote from someone else who was also surprised:

[...] decided I would try and sprinkle some of the new Julia magic speed dust on it such as @inbounds, and was surprised to find that it actually made the whole thing slower:

Should this issue stay open to track documentation of the existing behavior?

@mbauman mbauman changed the title Code with @inbounds can be slower than code without it Could @inbounds communicate its (removed) invariants to LLVM? Apr 7, 2021
@mbauman mbauman added the speculative Whether the change will be implemented is speculative label Apr 7, 2021
@mbauman mbauman reopened this Apr 7, 2021
@StefanKarpinski
Copy link
Member

It seemed to me that @maleadt's comment indicated that this issue was fixed, so I closed it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrays [a, r, r, a, y, s] performance Must go faster speculative Whether the change will be implemented is speculative
Projects
None yet
Development

No branches or pull requests

6 participants