Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: macros for disabling bounds checks #3268

Merged
merged 7 commits into from
Jul 5, 2013
Merged

RFC: macros for disabling bounds checks #3268

merged 7 commits into from
Jul 5, 2013

Conversation

JeffBezanson
Copy link
Member

Provides @inbounds expr and @boundscheck true|false expr. The former is just a shorter form of the latter.

The gains here are not huge; the best case is generic matmul, which is 2x faster. The vectoriz benchmark in perf2 is 10% faster. But, there are more places in Base that could benefit from inbounds.

`@inbounds expr` allows bounds checks to be omitted for code syntactically
inside the argument expression.

adds @inbounds to matmul, comprehensions, and vectorized binary operators.
@ViralBShah
Copy link
Member

A 2x gain is nothing to sneeze at!

@GlenHertz
Copy link
Contributor

+1 But I think @noboundscheck is a clearer name since I'm unfamiliar with inbounds. Maybe it is just me.

@johnmyleswhite
Copy link
Member

@inbounds is basically short for @assert_inbounds, which might help to convey the fact that this macro won't turn off bounds checking, but will simply tell the compiler that turning out bounds checking is allowed.

@ViralBShah
Copy link
Member

Is this ready to be merged in? I am keen to implement this in many of the sparse kernels.

@JeffBezanson JeffBezanson mentioned this pull request Jun 18, 2013
65 tasks
@ViralBShah ViralBShah mentioned this pull request Jun 23, 2013
JeffBezanson added a commit that referenced this pull request Jul 5, 2013
RFC: macros for disabling bounds checks
@JeffBezanson JeffBezanson merged commit c575c41 into master Jul 5, 2013
@ViralBShah
Copy link
Member

Could you add some documentation to the manual - or should we have a separate doc issue?

@lindahua
Copy link
Contributor

lindahua commented Jul 5, 2013

Great!!

@lindahua
Copy link
Contributor

lindahua commented Jul 5, 2013

I did a simple benchmark for this, as follows:

using NumericExtensions

function add0!(x::Vector{Float64}, y::Vector{Float64}, z::Vector{Float64})
    for i in 1 : length(x)
        z[i] = x[i] + y[i]
    end
end

function add1!(x::Vector{Float64}, y::Vector{Float64}, z::Vector{Float64})
    for i in 1 : length(x)
        @inbounds z[i] = x[i] + y[i]
    end
end

function add2!(x::Vector{Float64}, y::Vector{Float64}, z::Vector{Float64})
    xv = unsafe_view(x)
    yv = unsafe_view(y)
    zv = unsafe_view(z)
    for i in 1 : length(x)
        zv[i] = xv[i] + yv[i]
    end
end

n = 10^3
x = rand(n)
y = rand(n)
z = zeros(n)

add0!(x, y, z)
add1!(x, y, z)
add2!(x, y, z)

@time for i in 1 : 10^4 add0!(x, y, z) end
@time for i in 1 : 10^4 add1!(x, y, z) end
@time for i in 1 : 10^4 add2!(x, y, z) end

Results:

elapsed time: 0.015487923 seconds   # without inbounds
elapsed time: 0.009834733 seconds   # use inbounds
elapsed time: 0.006274827 seconds   # use unsafe_view

@inbounds of course results in considerable improvement (about 60% compared to without using it). What I am still curious about is why there remains such a gap between @inbounds and unsafe_view (which simply does pointer arithmetics and an unsafe_load).

@lindahua
Copy link
Contributor

lindahua commented Jul 5, 2013

Here compares the disassemble loops:

Using @inbound:

  %"#s95.03" = phi i64 [ 1, %if.lr.ph ], [ %26, %if ]
  %13 = add i64 %"#s95.03", -1, !dbg !6370
  %14 = load %jl_value_t** %8, !dbg !6370
  %15 = bitcast %jl_value_t* %14 to double*, !dbg !6370
  %16 = getelementptr double* %15, i64 %13, !dbg !6370
  %17 = load double* %16, !dbg !6370
  %18 = load %jl_value_t** %10, !dbg !6370
  %19 = bitcast %jl_value_t* %18 to double*, !dbg !6370
  %20 = getelementptr double* %19, i64 %13, !dbg !6370
  %21 = load double* %20, !dbg !6370
  %22 = fadd double %17, %21, !dbg !6370
  %23 = load %jl_value_t** %12, !dbg !6370
  %24 = bitcast %jl_value_t* %23 to double*, !dbg !6370
  %25 = getelementptr double* %24, i64 %13, !dbg !6370
  store double %22, double* %25, !dbg !6370
  %26 = add i64 %"#s95.03", 1, !dbg !6371
  %27 = icmp sgt i64 %26, %6, !dbg !6363
  br i1 %27, label %L2, label %if, !dbg !6363

Using unsafe_view:

  %"#s95.03" = phi i64 [ %26, %if ], [ 1, %top ]
  %19 = add i64 %"#s95.03", -1, !dbg !6400
  %20 = getelementptr double* %6, i64 %19, !dbg !6400
  %21 = load double* %20, !dbg !6400
  %22 = getelementptr double* %10, i64 %19, !dbg !6400
  %23 = load double* %22, !dbg !6400
  %24 = fadd double %21, %23, !dbg !6400
  %25 = getelementptr double* %14, i64 %19, !dbg !6400
  store double %24, double* %25, !dbg !6400
  %26 = add i64 %"#s95.03", 1, !dbg !6400
  %27 = icmp sgt i64 %26, %17, !dbg !6399
  br i1 %27, label %L2, label %if, !dbg !6399

They are pretty similar, except that in the @inbounds version, there are several additional statements to cast between jl_value_t* and double*. If these castings can be elided, we can get a very tight and efficient loop.

@StefanKarpinski
Copy link
Member

A lot of those LLVM instructions are no-ops; maybe look at the actual machine code generated?

@lindahua
Copy link
Contributor

lindahua commented Jul 5, 2013

Here are the ASM code (for conciseness, only the loop body is shown):

The @inbounds version:

mov RAX, QWORD PTR [RDI + 8]
vmovsd  XMM0, QWORD PTR [RAX + 8*RCX - 8]
mov RAX, QWORD PTR [RSI + 8]
vaddsd  XMM0, XMM0, QWORD PTR [RAX + 8*RCX - 8]
mov RAX, QWORD PTR [RDX + 8]
vmovsd  QWORD PTR [RAX + 8*RCX - 8], XMM0
inc RCX
cmp RCX, R8

The unsafe_view version:

vmovsd  XMM0, QWORD PTR [RSI + 8*RDI - 8]
vaddsd  XMM0, XMM0, QWORD PTR [RDX + 8*RDI - 8]
vmovsd  QWORD PTR [RCX + 8*RDI - 8], XMM0
inc RDI
cmp RDI, RAX

Obviously, the @inbound version generates three additional mov instructions that moves pointers to RAX, which is unnecessary -- I suspect that these instructions are emitted by the pointer casting instructions in LLVM.

@ViralBShah
Copy link
Member

This should really not matter, but what if you add @inboounds to the unsafe_view?

@lindahua
Copy link
Contributor

lindahua commented Jul 5, 2013

unsafe_view performs no bound checking even without @inbounds.

Those mov statements actually use additional CPU cycles, which explains the gap here.

@lindahua
Copy link
Contributor

lindahua commented Jul 5, 2013

Think this again. The ways these two versions work are different:

For the @unbounds version, the pointer (base address of the array) are loaded from memory into register (via mov instructions) every time an element is accessed.

For the unsafe_view version, the pointer has been kept in the register, and therefore there is no need to load it into registers at each iteration.

Loading a pointer variable to register is a cheap operation, but it still introduces overhead. In a tight loop where only a couple of instructions are done in each iteration, such overhead can become significant. In the example (doing element-wise addition) above, this slows down the overall performance by 50%.

@ViralBShah
Copy link
Member

Thanks. I read the original message incorrectly, that unsafe_view was slower than @inbounds.

@ViralBShah
Copy link
Member

It would also be nice to have a command line option that enables all bounds checks - which can be the case for make debug.

@JeffBezanson
Copy link
Member Author

This is on the list in #3440. The reason is that the code generator does not have a sufficiently good understanding of when a vector might change size, and so is reloading its data pointer too often.

@StefanKarpinski
Copy link
Member

I suspect the reload on every iteration is due to vectors being growable – and therefore movable. If you try the comparison for matrices instead of vectors, I the difference might go away.

@lindahua
Copy link
Contributor

lindahua commented Jul 5, 2013

Yes, the difference goes away when Matrix is used. ASM becomes the same in such case too.

Can this be tackled?

@JeffBezanson
Copy link
Member Author

Yes I know how to fix this.

@StefanKarpinski
Copy link
Member

What's the plan? It seems like this entire class of problems would go away if you could communicate to LLVM that only this thread might change the location of the vector. We know that function calls in this thread might change the address of a vector but in this case there are no function calls, so if LLVM knew that the reload was only necessary if this thread changed the location of the vector, then it could eliminate the reloads.

@JeffBezanson
Copy link
Member Author

What it needs is aliasing information, or for me to hoist the loads to where we know they are needed and then rely on dead code elim. There is also the easy case of vectors that don't escape (in the strictest possible sense of never being passed anywhere), which can be handled the same way I handle N-d arrays.

@StefanKarpinski StefanKarpinski deleted the jb/inbounds branch November 16, 2013 16:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants