-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC: macros for disabling bounds checks #3268
Conversation
`@inbounds expr` allows bounds checks to be omitted for code syntactically inside the argument expression. adds @inbounds to matmul, comprehensions, and vectorized binary operators.
A 2x gain is nothing to sneeze at! |
+1 But I think @noboundscheck is a clearer name since I'm unfamiliar with inbounds. Maybe it is just me. |
|
Conflicts: src/alloc.c src/codegen.cpp src/jltypes.c
Is this ready to be merged in? I am keen to implement this in many of the sparse kernels. |
RFC: macros for disabling bounds checks
Could you add some documentation to the manual - or should we have a separate doc issue? |
Great!! |
I did a simple benchmark for this, as follows: using NumericExtensions
function add0!(x::Vector{Float64}, y::Vector{Float64}, z::Vector{Float64})
for i in 1 : length(x)
z[i] = x[i] + y[i]
end
end
function add1!(x::Vector{Float64}, y::Vector{Float64}, z::Vector{Float64})
for i in 1 : length(x)
@inbounds z[i] = x[i] + y[i]
end
end
function add2!(x::Vector{Float64}, y::Vector{Float64}, z::Vector{Float64})
xv = unsafe_view(x)
yv = unsafe_view(y)
zv = unsafe_view(z)
for i in 1 : length(x)
zv[i] = xv[i] + yv[i]
end
end
n = 10^3
x = rand(n)
y = rand(n)
z = zeros(n)
add0!(x, y, z)
add1!(x, y, z)
add2!(x, y, z)
@time for i in 1 : 10^4 add0!(x, y, z) end
@time for i in 1 : 10^4 add1!(x, y, z) end
@time for i in 1 : 10^4 add2!(x, y, z) end Results:
|
Here compares the disassemble loops: Using
Using
They are pretty similar, except that in the |
A lot of those LLVM instructions are no-ops; maybe look at the actual machine code generated? |
Here are the ASM code (for conciseness, only the loop body is shown): The mov RAX, QWORD PTR [RDI + 8]
vmovsd XMM0, QWORD PTR [RAX + 8*RCX - 8]
mov RAX, QWORD PTR [RSI + 8]
vaddsd XMM0, XMM0, QWORD PTR [RAX + 8*RCX - 8]
mov RAX, QWORD PTR [RDX + 8]
vmovsd QWORD PTR [RAX + 8*RCX - 8], XMM0
inc RCX
cmp RCX, R8 The vmovsd XMM0, QWORD PTR [RSI + 8*RDI - 8]
vaddsd XMM0, XMM0, QWORD PTR [RDX + 8*RDI - 8]
vmovsd QWORD PTR [RCX + 8*RDI - 8], XMM0
inc RDI
cmp RDI, RAX Obviously, the |
This should really not matter, but what if you add |
Those |
Think this again. The ways these two versions work are different: For the For the Loading a pointer variable to register is a cheap operation, but it still introduces overhead. In a tight loop where only a couple of instructions are done in each iteration, such overhead can become significant. In the example (doing element-wise addition) above, this slows down the overall performance by 50%. |
Thanks. I read the original message incorrectly, that |
It would also be nice to have a command line option that enables all bounds checks - which can be the case for |
This is on the list in #3440. The reason is that the code generator does not have a sufficiently good understanding of when a vector might change size, and so is reloading its data pointer too often. |
I suspect the reload on every iteration is due to vectors being growable – and therefore movable. If you try the comparison for matrices instead of vectors, I the difference might go away. |
Yes, the difference goes away when Can this be tackled? |
Yes I know how to fix this. |
What's the plan? It seems like this entire class of problems would go away if you could communicate to LLVM that only this thread might change the location of the vector. We know that function calls in this thread might change the address of a vector but in this case there are no function calls, so if LLVM knew that the reload was only necessary if this thread changed the location of the vector, then it could eliminate the reloads. |
What it needs is aliasing information, or for me to hoist the loads to where we know they are needed and then rely on dead code elim. There is also the easy case of vectors that don't escape (in the strictest possible sense of never being passed anywhere), which can be handled the same way I handle N-d arrays. |
Provides
@inbounds expr
and@boundscheck true|false expr
. The former is just a shorter form of the latter.The gains here are not huge; the best case is generic matmul, which is 2x faster. The
vectoriz
benchmark in perf2 is 10% faster. But, there are more places in Base that could benefit frominbounds
.