Checking sizes on array indices #996

timholy · 2012-06-30T08:53:07Z

Currently the following is allowed in Julia:

julia> A = reshape(1:15,3,5)
3x5 Int64 Array:
 1  4  7  10  13
 2  5  8  11  14
 3  6  9  12  15

julia> A[4,2]
7

julia> X = zeros(1,12);

julia> X[2,2] = 5
5

julia> X
1x12 Float64 Array:
 0.0  0.0  5.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0

That's because checking is done to make sure that no invalid memory will be accessed, but the individual indices are not checked. This seems reasonable, but of course it's a little unexpected (esp. compared to Matlab) and could make it harder to notice bugs in algorithms.

The downside of more checking, of course, is that there is a performance hit.

I see three options:

Leave things as they are
Introduce checking to all ref/assign functions for arrays
Do both: make ref/assign check by default, but also create ref_unsafe and assign_unsafe variants that don't check. Also define a macro @disable_arrayind_check that one can use to wrap a function definition to get more performance out of well-tested code. That macro would "just" change all cases of ref/assign for arrays to their unsafe versions (not sure how easy this would be to write).

Obviously the third is the most ambitious. Given the new array ref/assign code that just got checked in, I'm also going to be testing whether we can yank some of the specialized definitions (particularly abundant for assign). If so, that will make this issue easier to address.

Or perhaps there are other ways?

The text was updated successfully, but these errors were encountered:

timholy · 2012-06-30T08:58:38Z

Also related: issue #911.

ViralBShah · 2012-06-30T14:02:41Z

We should certainly introduce bounds checking in all cases. It would be great if we can do it in a way that allows the compiler to remove these checks when possible. I am not excited about the macro for unsafe variants.

timholy · 2012-06-30T14:44:53Z

I don't at the moment see how the compiler can decide that, but let me help make the discussion more concrete with an example. Suppose we have a function:

check_bounds(dims::Dims, I::Indices...)

which does the obvious thing (loop over I[i][j] and test whether it's between 1 and dims[i]). For the next two code snippets, let's also mean A[i,j] to be a function that does not check bounds. Then for something like ref this would be a bad way to do it:

for i = ind1
    for j = ind2
        check_bounds(size(A), i, j)
        X[storeind] = A[i,j]
        storeind += 1
    end
end

It's bad because each index is of length m or n yet you call check_bounds m*n times. It would be much better to do this in a loop at the beginning on each index separately:

check_bounds(size(A,1), ind1)
check_bounds(size(A,2), ind2)
# Now do the loop, skipping the bounds check

In the case when both m and n are much bigger than 1, the overhead of the bounds check is negligible by comparison to the operation inside the double loop.

(Actually, are we in effect doing the first version now? Does arrayref check its index on each access?)

The really tough case is for an algorithm that does lots of scalar accesses, e.g., a matrix routine that has lots of A[i,j] internally. If we want to check A[i,j] when typed from the command line, such statements will also be checked inside the function each and every time they are run. Maybe the compiler can do something about that, but not being a compiler guy this is way out of my league to answer. I would naively guess that would be very hard.

The idea behind a macro is to make it easy to write your function with full-checking (checking is turned on by default for A[i,j]). Once you're convinced it's bug-free, wrap it in @disable_arrayind_check begin ... end to get slightly higher performance. Not that I'm saying this is gorgeous, but I'm having a hard time thinking of a better way. Short of compiler magic (which there may well be), that is.

Libraries like Eigen provide a A(i,j) which checks its bounds, and a ref(A,i,j) which does not. There's also a compiler flag to turn all A(i,j) into ref(A,i,j). One nice thing about the macro is that we'd be able to leave checking on in general and disable it for the parts we're confident about.

ViralBShah · 2012-06-30T15:27:51Z

It is worth pointing out that bounds checks are also inserted by the compiler in arrayref and arrayset.

https://github.com/JuliaLang/julia/blob/master/src/codegen.cpp#L847
https://github.com/JuliaLang/julia/blob/master/src/codegen.cpp#L891

These checks will catch certain types of out of bounds issues. Compilers do eliminate bounds checks, but this probably needs more information at compile time.

I just don't like having a large program, where you insert @disable_bounds_check at the beginning, and an end statement at the end. Since this is likely to interact with the codegen in any case at a later point, a compiler switch may be a better way to go. Anyways, I am not religious about these views and am willing to try out a few different options.

In any case, we should certainly implement all the necessary safety checks. How we enable/disable them is something we can continue exploring and improving over time.

-viral

On 30-Jun-2012, at 8:14 PM, Tim Holy wrote:

I don't at the moment see how the compiler can decide that, but let me help make the discussion more concrete with an example. Suppose we have a function:

check_bounds(dims::Dims, I::Indices...)

which does the obvious thing (loop over I[i][j] and test whether it's between 1 and dims[i]). For the next two code snippets, let's also mean A[i,j] to be a function that does not check bounds. Then for something like ref this would be a bad way to do it:

for i = ind1
for j = ind2
check_bounds(size(A), i, j)
X[storeind] = A[i,j]
storeind += 1
end
end

It's bad because each index is of length m or n yet you call check_bounds m*n times. It would be much better to do this in a loop at the beginning on each index separately:

check_bounds(size(A,1), ind1)
check_bounds(size(A,2), ind2)

Now do the loop, skipping the bounds check

In the case when both m and n are much bigger than 1, the overhead of the bounds check is negligible by comparison to the operation inside the double loop.

(Actually, are we in effect doing the first version now? Does arrayref check its index on each access?)

The really tough case is for an algorithm that does lots of scalar accesses, e.g., a matrix routine that has lots of A[i,j] internally. If we want to check A[i,j] when typed from the command line, such statements will also be checked inside the function each and every time they are run. Maybe the compiler can do something about that, but not being a compiler guy this is way out of my league to answer. I would naively guess that would be very hard.

The idea behind a macro is to make it easy to write your function with full-checking (checking is turned on by default for A[i,j]). Once you're convinced it's bug-free, wrap it in @disable_arrayind_check begin ... end to get slightly higher performance. Not that I'm saying this is gorgeous, but I'm having a hard time thinking of a better way. Short of compiler magic (which there may well be), that is.

Libraries like Eigen provide a A(i,j) which checks its bounds, and a ref(A,i,j) which does not. There's also a compiler flag to turn all A(i,j) into ref(A,i,j). One nice thing about the macro is that we'd be able to leave checking on in general and disable it for the parts we're confident about.

Reply to this email directly or view it on GitHub:
#996 (comment)

timholy · 2012-06-30T15:48:26Z

Once you get down to codegen, you are beyond my knowledge of Julia. I'll be glad to hear more about how this works (and study it myself of course, but this will take time).

StefanKarpinski · 2012-06-30T17:14:32Z

@JeffBezanson, is there any way we could leverage the type system to help eliminate bounds checks? Like having invisible "dimension types" associated with each of an array object's indices and letting type inference pass that information along? That gets awfully close to having array dimension sizes in the types though.

ViralBShah · 2012-06-30T17:48:33Z

Are dimensions likely to be always available when type inference runs? Isn't this more like a compiler pass than anything specific to do with type inference?

@JeffBezanson has also said that we need a few other related things for array indexing performance - hoist some of the metadata access, and inline or optimize the assign() call. This is of course not directly related to this issue, but presumably we should plan out things so that we can go all the way.

-viral

On 30-Jun-2012, at 10:44 PM, Stefan Karpinski wrote:

@JeffBezanson, is there any way we could leverage the type system to help eliminate bounds checks? Like having invisible "dimension types" associated with each of an array object's indices and letting type inference pass that information along? That gets awfully close to having array dimension sizes in the types though.

Reply to this email directly or view it on GitHub:
#996 (comment)

timholy · 2012-06-30T19:33:07Z

And even for direct copies, issue #190 still remains to be done. But 90% of our performance problems got cured this morning. I didn't want to add anything, like bounds-checking, that would seem to reduce the benefit!

timholy · 2012-07-02T10:43:01Z

There's also this in array.c:

JL_CALLABLE(jl_f_arrayref)
{
    JL_NARGS(arrayref, 2, 2);
    JL_TYPECHK(arrayref, array, args[0]);
    JL_TYPECHK(arrayref, long, args[1]);
    jl_array_t *a = (jl_array_t*)args[0];
    size_t i = jl_unbox_long(args[1])-1;
    if (i >= a->length) {
        jl_errorf("ref array[%d]: index out of range", i+1);
    }
    return jl_arrayref(a, i);
}

This would seem to suggest that bounds are being checked on each access to an element of an array.

ViralBShah · 2012-07-02T14:10:56Z

As I understand, either this version gets called, or if types are inferred, the fast codegen version gets called. In either case, there is always a check in the current situation.

-viral

On 02-Jul-2012, at 4:13 PM, Tim Holy wrote:

There's also this in array.c:

JL_CALLABLE(jl_f_arrayref)
{
JL_NARGS(arrayref, 2, 2);
JL_TYPECHK(arrayref, array, args[0]);
JL_TYPECHK(arrayref, long, args[1]);
jl_array_t a = (jl_array_t)args[0];
size_t i = jl_unbox_long(args[1])-1;
if (i >= a->length) {
jl_errorf("ref array[%d]: index out of range", i+1);
}
return jl_arrayref(a, i);
}

This would seem to suggest that bounds are being checked on each access to an element of an array.

Reply to this email directly or view it on GitHub:
#996 (comment)

StefanKarpinski · 2012-07-02T15:08:17Z

Yes, that's the run-time version, as in the one that gets used when fast code-gen doesn't occur.

timholy · 2012-07-02T16:00:21Z

Gotcha. Quite informative. Would love a doc on "julia internals" someday (I know you have better things to do at the moment).

JeffBezanson · 2012-07-10T22:10:49Z

I will extend arrayref and arrayset to accept all indexes, and then code generation can emit in-line bounds checks for each dimension. check_bounds could be an intrinsic, which would allow it to be disabled by the same future compiler switch that disables all other bounds checks.

timholy · 2012-07-10T22:15:03Z

Is this for all integer indices? I.e., A[i,j,k]? If so, that will reverse all the progress we've made in multidimensional performance (#765). The whole point of that is that you don't want to recompute i + size(A,1)*(j-1 + (k-1)*size(A,2) each time you access one element of an array in a patterned way.

JeffBezanson · 2012-07-10T22:25:20Z

You got that performance by using only linear indexing, right? In general, when the expression A[i,j,k] occurs in code you have to do all the work each time. Your indexing functions would do custom bounds checking using check_bounds.

timholy · 2012-07-10T22:27:14Z

Not at all. You check each index upon entry to ref, before you do any looping over any of the indices. That way it's O(dN) and not O(N^d). See the third post in this thread.

timholy · 2012-07-10T22:31:12Z

But when you are not accessing in a patterned way, but instead accessing individual elements "at random", then of course you do have to do a check on each index, each time. That's when bounds-checking gets nasty for performance, and why it would be nice to turn off in a well-vetted algorithm.

JeffBezanson · 2012-07-10T22:43:29Z

I did say "in general". In other words, when code contains A[i,j,k] and we have no other information.
This is why scalar ref and assign must generate bounds checks, which could be disabled by a compiler switch and/or an @inbounds macro. Of course, the general case is not the common case!

I also believe with some code generator improvements on our end LLVM could hoist some of the operations out of loops.

Code like your fancy indexing code is a different matter --- it uses linear indexing, so checking all dimensions has to be done in a custom way, as you describe. Using an intrinsic would allow those checks to also be disabled by the compiler switch.

So we need to do a couple different things: add the checks to arrayref(a, i, j, k, ...), add the macro and compiler switch, and address special library functions in custom ways.

timholy · 2012-07-10T22:47:53Z

That sounds like a plan. I can do the library functions (I'm sure) and macro (I think), if you can do the compiler end. Will there be an arrayref_unsafe that I should call explicitly if I've already handled all the checking? Or is there compiler magic to just know it doesn't need to check when called from certain functions?

JeffBezanson · 2012-07-10T22:53:06Z

The macro can't be implemented as a simple code rewrite, since macros can only see surface syntax. You can't even distinguish an array access from, say, a dictionary lookup. So, all we can do is surround the macro argument in special to-be-added forms bounds_check_off and bounds_check_pop_state. Codegen can then disable checks for code between those. It will only affect inline code, but I think it's the best we can do.

timholy · 2012-07-10T22:59:43Z

Hmm, thanks for pointing that out, I was just beginning to wonder how I was going to do that.

So just to make sure we're on the same page: I'll rename the large majority of our ref/assign function as ref_unsafe and assign_unsafe, and then write ref and assign with the same arguments that do the bounds checking, then call the unsafe variants for the actual work. The macro itself is simply:

macro disable_bounds_check(ex)
    quote
        bounds_check_off
        $ex
        bounds_check_pop_state
    end
end

which either one of us can write, but I'd let you do it just so you can pick names you like.

JeffBezanson · 2012-07-10T23:05:05Z

Oh god no! Actually having two versions of every indexing function would be horrible. A bit of performance is not worth that kind of nightmare. I will take care of the macro.

timholy · 2012-07-10T23:09:37Z

Ah, good, thanks for clarifying---you're saying the macro can work even on the core functions to avoid the explicit call to check_bounds, and not just on A[i,j,k]?

So the only part for me to do is write check_bounds into the existing ref/assign functions, and you'll take care of all the hard stuff? :-)

timholy · 2012-07-10T23:11:25Z

In other words, I guess I was wondering whether all library functions can be inlined. make_loop_nest and friends isn't tiny. Or is the notion that we won't worry about turning off explicit calls to check_bounds, because the overhead (in the typical case) is small?

timholy · 2012-07-10T23:13:16Z

...of course, make_loop_nest doesn't actually get called except the first time... anyway, as you can tell, I'm a little murky on the deeper details! I'll just start writing the bounds-checked version and let you correct me if I'm wrong.

ghost assigned JeffBezanson Aug 8, 2012

JeffBezanson closed this as completed in 87e24f4 Oct 9, 2012

timholy mentioned this issue Oct 16, 2012

Provide mechanism for turning off bounds checking #1392

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Checking sizes on array indices #996

Checking sizes on array indices #996

timholy commented Jun 30, 2012

timholy commented Jun 30, 2012

ViralBShah commented Jun 30, 2012

timholy commented Jun 30, 2012

ViralBShah commented Jun 30, 2012

Now do the loop, skipping the bounds check

timholy commented Jun 30, 2012

StefanKarpinski commented Jun 30, 2012

ViralBShah commented Jun 30, 2012

timholy commented Jun 30, 2012

timholy commented Jul 2, 2012

ViralBShah commented Jul 2, 2012

StefanKarpinski commented Jul 2, 2012

timholy commented Jul 2, 2012

JeffBezanson commented Jul 10, 2012

timholy commented Jul 10, 2012

JeffBezanson commented Jul 10, 2012

timholy commented Jul 10, 2012

timholy commented Jul 10, 2012

JeffBezanson commented Jul 10, 2012

timholy commented Jul 10, 2012

JeffBezanson commented Jul 10, 2012

timholy commented Jul 10, 2012

JeffBezanson commented Jul 10, 2012

timholy commented Jul 10, 2012

timholy commented Jul 10, 2012

timholy commented Jul 10, 2012

Checking sizes on array indices #996

Checking sizes on array indices #996

Comments

timholy commented Jun 30, 2012

timholy commented Jun 30, 2012

ViralBShah commented Jun 30, 2012

timholy commented Jun 30, 2012

ViralBShah commented Jun 30, 2012

Now do the loop, skipping the bounds check

timholy commented Jun 30, 2012

StefanKarpinski commented Jun 30, 2012

ViralBShah commented Jun 30, 2012

timholy commented Jun 30, 2012

timholy commented Jul 2, 2012

ViralBShah commented Jul 2, 2012

StefanKarpinski commented Jul 2, 2012

timholy commented Jul 2, 2012

JeffBezanson commented Jul 10, 2012

timholy commented Jul 10, 2012

JeffBezanson commented Jul 10, 2012

timholy commented Jul 10, 2012

timholy commented Jul 10, 2012

JeffBezanson commented Jul 10, 2012

timholy commented Jul 10, 2012

JeffBezanson commented Jul 10, 2012

timholy commented Jul 10, 2012

JeffBezanson commented Jul 10, 2012

timholy commented Jul 10, 2012

timholy commented Jul 10, 2012

timholy commented Jul 10, 2012