RFC: macros for disabling bounds checks #3268

JeffBezanson · 2013-06-02T06:36:39Z

Provides @inbounds expr and @boundscheck true|false expr. The former is just a shorter form of the latter.

The gains here are not huge; the best case is generic matmul, which is 2x faster. The vectoriz benchmark in perf2 is 10% faster. But, there are more places in Base that could benefit from inbounds.

@inbounds

`@inbounds expr` allows bounds checks to be omitted for code syntactically inside the argument expression. adds @inbounds to matmul, comprehensions, and vectorized binary operators.

ViralBShah · 2013-06-02T09:12:46Z

A 2x gain is nothing to sneeze at!

GlenHertz · 2013-06-05T01:29:16Z

+1 But I think @noboundscheck is a clearer name since I'm unfamiliar with inbounds. Maybe it is just me.

johnmyleswhite · 2013-06-05T01:34:07Z

@inbounds is basically short for @assert_inbounds, which might help to convey the fact that this macro won't turn off bounds checking, but will simply tell the compiler that turning out bounds checking is allowed.

Conflicts: src/alloc.c src/codegen.cpp src/jltypes.c

ViralBShah · 2013-06-18T06:40:27Z

Is this ready to be merged in? I am keen to implement this in many of the sparse kernels.

RFC: macros for disabling bounds checks

ViralBShah · 2013-07-05T06:25:25Z

Could you add some documentation to the manual - or should we have a separate doc issue?

lindahua · 2013-07-05T13:43:59Z

Great!!

lindahua · 2013-07-05T14:18:50Z

I did a simple benchmark for this, as follows:

using NumericExtensions

function add0!(x::Vector{Float64}, y::Vector{Float64}, z::Vector{Float64})
    for i in 1 : length(x)
        z[i] = x[i] + y[i]
    end
end

function add1!(x::Vector{Float64}, y::Vector{Float64}, z::Vector{Float64})
    for i in 1 : length(x)
        @inbounds z[i] = x[i] + y[i]
    end
end

function add2!(x::Vector{Float64}, y::Vector{Float64}, z::Vector{Float64})
    xv = unsafe_view(x)
    yv = unsafe_view(y)
    zv = unsafe_view(z)
    for i in 1 : length(x)
        zv[i] = xv[i] + yv[i]
    end
end

n = 10^3
x = rand(n)
y = rand(n)
z = zeros(n)

add0!(x, y, z)
add1!(x, y, z)
add2!(x, y, z)

@time for i in 1 : 10^4 add0!(x, y, z) end
@time for i in 1 : 10^4 add1!(x, y, z) end
@time for i in 1 : 10^4 add2!(x, y, z) end

Results:

elapsed time: 0.015487923 seconds   # without inbounds
elapsed time: 0.009834733 seconds   # use inbounds
elapsed time: 0.006274827 seconds   # use unsafe_view

@inbounds of course results in considerable improvement (about 60% compared to without using it). What I am still curious about is why there remains such a gap between @inbounds and unsafe_view (which simply does pointer arithmetics and an unsafe_load).

lindahua · 2013-07-05T14:38:12Z

Here compares the disassemble loops:

Using @inbound:

  %"#s95.03" = phi i64 [ 1, %if.lr.ph ], [ %26, %if ]
  %13 = add i64 %"#s95.03", -1, !dbg !6370
  %14 = load %jl_value_t** %8, !dbg !6370
  %15 = bitcast %jl_value_t* %14 to double*, !dbg !6370
  %16 = getelementptr double* %15, i64 %13, !dbg !6370
  %17 = load double* %16, !dbg !6370
  %18 = load %jl_value_t** %10, !dbg !6370
  %19 = bitcast %jl_value_t* %18 to double*, !dbg !6370
  %20 = getelementptr double* %19, i64 %13, !dbg !6370
  %21 = load double* %20, !dbg !6370
  %22 = fadd double %17, %21, !dbg !6370
  %23 = load %jl_value_t** %12, !dbg !6370
  %24 = bitcast %jl_value_t* %23 to double*, !dbg !6370
  %25 = getelementptr double* %24, i64 %13, !dbg !6370
  store double %22, double* %25, !dbg !6370
  %26 = add i64 %"#s95.03", 1, !dbg !6371
  %27 = icmp sgt i64 %26, %6, !dbg !6363
  br i1 %27, label %L2, label %if, !dbg !6363

Using unsafe_view:

  %"#s95.03" = phi i64 [ %26, %if ], [ 1, %top ]
  %19 = add i64 %"#s95.03", -1, !dbg !6400
  %20 = getelementptr double* %6, i64 %19, !dbg !6400
  %21 = load double* %20, !dbg !6400
  %22 = getelementptr double* %10, i64 %19, !dbg !6400
  %23 = load double* %22, !dbg !6400
  %24 = fadd double %21, %23, !dbg !6400
  %25 = getelementptr double* %14, i64 %19, !dbg !6400
  store double %24, double* %25, !dbg !6400
  %26 = add i64 %"#s95.03", 1, !dbg !6400
  %27 = icmp sgt i64 %26, %17, !dbg !6399
  br i1 %27, label %L2, label %if, !dbg !6399

They are pretty similar, except that in the @inbounds version, there are several additional statements to cast between jl_value_t* and double*. If these castings can be elided, we can get a very tight and efficient loop.

StefanKarpinski · 2013-07-05T17:12:40Z

A lot of those LLVM instructions are no-ops; maybe look at the actual machine code generated?

lindahua · 2013-07-05T17:54:33Z

Here are the ASM code (for conciseness, only the loop body is shown):

The @inbounds version:

mov RAX, QWORD PTR [RDI + 8]
vmovsd  XMM0, QWORD PTR [RAX + 8*RCX - 8]
mov RAX, QWORD PTR [RSI + 8]
vaddsd  XMM0, XMM0, QWORD PTR [RAX + 8*RCX - 8]
mov RAX, QWORD PTR [RDX + 8]
vmovsd  QWORD PTR [RAX + 8*RCX - 8], XMM0
inc RCX
cmp RCX, R8

The unsafe_view version:

vmovsd  XMM0, QWORD PTR [RSI + 8*RDI - 8]
vaddsd  XMM0, XMM0, QWORD PTR [RDX + 8*RDI - 8]
vmovsd  QWORD PTR [RCX + 8*RDI - 8], XMM0
inc RDI
cmp RDI, RAX

Obviously, the @inbound version generates three additional mov instructions that moves pointers to RAX, which is unnecessary -- I suspect that these instructions are emitted by the pointer casting instructions in LLVM.

ViralBShah · 2013-07-05T17:59:05Z

This should really not matter, but what if you add @inboounds to the unsafe_view?

lindahua · 2013-07-05T18:00:13Z

unsafe_view performs no bound checking even without @inbounds.

Those mov statements actually use additional CPU cycles, which explains the gap here.

lindahua · 2013-07-05T18:06:09Z

Think this again. The ways these two versions work are different:

For the @unbounds version, the pointer (base address of the array) are loaded from memory into register (via mov instructions) every time an element is accessed.

For the unsafe_view version, the pointer has been kept in the register, and therefore there is no need to load it into registers at each iteration.

Loading a pointer variable to register is a cheap operation, but it still introduces overhead. In a tight loop where only a couple of instructions are done in each iteration, such overhead can become significant. In the example (doing element-wise addition) above, this slows down the overall performance by 50%.

ViralBShah · 2013-07-05T18:08:10Z

Thanks. I read the original message incorrectly, that unsafe_view was slower than @inbounds.

ViralBShah · 2013-07-05T18:21:48Z

It would also be nice to have a command line option that enables all bounds checks - which can be the case for make debug.

JeffBezanson · 2013-07-05T18:39:48Z

This is on the list in #3440. The reason is that the code generator does not have a sufficiently good understanding of when a vector might change size, and so is reloading its data pointer too often.

StefanKarpinski · 2013-07-05T18:42:59Z

I suspect the reload on every iteration is due to vectors being growable – and therefore movable. If you try the comparison for matrices instead of vectors, I the difference might go away.

lindahua · 2013-07-05T18:52:05Z

Yes, the difference goes away when Matrix is used. ASM becomes the same in such case too.

Can this be tackled?

JeffBezanson · 2013-07-05T18:57:07Z

Yes I know how to fix this.

StefanKarpinski · 2013-07-05T19:05:36Z

What's the plan? It seems like this entire class of problems would go away if you could communicate to LLVM that only this thread might change the location of the vector. We know that function calls in this thread might change the address of a vector but in this case there are no function calls, so if LLVM knew that the reload was only necessary if this thread changed the location of the vector, then it could eliminate the reloads.

JeffBezanson · 2013-07-05T19:11:08Z

What it needs is aliasing information, or for me to hoist the loads to where we know they are needed and then rely on dead code elim. There is also the easy case of vectors that don't escape (in the strictest possible sense of never being passed anywhere), which can be handled the same way I handle N-d arrays.

JeffBezanson added 2 commits June 2, 2013 02:28

macros for turning off bounds checks. fixes #1392

66ab577

`@inbounds expr` allows bounds checks to be omitted for code syntactically inside the argument expression. adds @inbounds to matmul, comprehensions, and vectorized binary operators.

Merge branch 'master' of github.com:JuliaLang/julia into jb/inbounds

ef81ce9

JeffBezanson added 2 commits June 4, 2013 17:16

Merge branch 'master' of github.com:JuliaLang/julia into jb/inbounds

86c7d97

clarify boundscheck expressions

75ddb34

JeffBezanson added 2 commits June 14, 2013 00:08

Merge branch 'master' of github.com:JuliaLang/julia into jb/inbounds

f73b776

Merge branch 'master' of github.com:JuliaLang/julia into jb/inbounds

419d3a7

Conflicts: src/alloc.c src/codegen.cpp src/jltypes.c

JeffBezanson mentioned this pull request Jun 18, 2013

compiler optimization tracker #3440

Closed

65 tasks

ViralBShah mentioned this pull request Jun 23, 2013

Re-working SubArray #3496

Closed

Merge branch 'master' of github.com:JuliaLang/julia into jb/inbounds

e0cdf63

JeffBezanson added a commit that referenced this pull request Jul 5, 2013

Merge pull request #3268 from JuliaLang/jb/inbounds

c575c41

RFC: macros for disabling bounds checks

JeffBezanson merged commit c575c41 into master Jul 5, 2013

lindahua mentioned this pull request Jul 5, 2013

sparse multiplication is slow #942

Closed

StefanKarpinski deleted the jb/inbounds branch November 16, 2013 16:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: macros for disabling bounds checks #3268

RFC: macros for disabling bounds checks #3268

JeffBezanson commented Jun 2, 2013

ViralBShah commented Jun 2, 2013

GlenHertz commented Jun 5, 2013

johnmyleswhite commented Jun 5, 2013

ViralBShah commented Jun 18, 2013

ViralBShah commented Jul 5, 2013

lindahua commented Jul 5, 2013

lindahua commented Jul 5, 2013

lindahua commented Jul 5, 2013

StefanKarpinski commented Jul 5, 2013

lindahua commented Jul 5, 2013

ViralBShah commented Jul 5, 2013

lindahua commented Jul 5, 2013

lindahua commented Jul 5, 2013

ViralBShah commented Jul 5, 2013

ViralBShah commented Jul 5, 2013

JeffBezanson commented Jul 5, 2013

StefanKarpinski commented Jul 5, 2013

lindahua commented Jul 5, 2013

JeffBezanson commented Jul 5, 2013

StefanKarpinski commented Jul 5, 2013

JeffBezanson commented Jul 5, 2013

RFC: macros for disabling bounds checks #3268

RFC: macros for disabling bounds checks #3268

Conversation

JeffBezanson commented Jun 2, 2013

ViralBShah commented Jun 2, 2013

GlenHertz commented Jun 5, 2013

johnmyleswhite commented Jun 5, 2013

ViralBShah commented Jun 18, 2013

ViralBShah commented Jul 5, 2013

lindahua commented Jul 5, 2013

lindahua commented Jul 5, 2013

lindahua commented Jul 5, 2013

StefanKarpinski commented Jul 5, 2013

lindahua commented Jul 5, 2013

ViralBShah commented Jul 5, 2013

lindahua commented Jul 5, 2013

lindahua commented Jul 5, 2013

ViralBShah commented Jul 5, 2013

ViralBShah commented Jul 5, 2013

JeffBezanson commented Jul 5, 2013

StefanKarpinski commented Jul 5, 2013

lindahua commented Jul 5, 2013

JeffBezanson commented Jul 5, 2013

StefanKarpinski commented Jul 5, 2013

JeffBezanson commented Jul 5, 2013