translate-c: handle negative array indices #8589

ehaas · 2021-04-21T07:26:48Z

Take advantage of wrapping pointer arithmetic to enable negative array
subscripts.

LemonBoy · 2021-04-21T08:00:36Z

wrapping pointer arithmetic

Let me CC @SpexGuy here, what does the specification say about pointer arithmetic?
#1918 suggests the wrap-around is unintended and should be fixed (does the + vs %+ distinction apply here?).

From a LLVM point of view we're codegen'ing pointer arithmetic using inbound GEPs and that's quite dangerous as we're potentially violating more than a single requirement:

If the inbounds keyword is present, the result value of the getelementptr is a poison value if one of the following rules is violated:

    The base pointer has an in bounds address of an allocated object, which means that it points into an allocated object, or to its end. The only in bounds address for a null pointer in the default address-space is the null pointer itself.
    If the type of an index is larger than the pointer index type, the truncation to the pointer index type preserves the signed value.
    The multiplication of an index by the type size does not wrap the pointer index type in a signed sense (nsw).
    The successive addition of offsets (without adding the base address) does not wrap the pointer index type in a signed sense (nsw).
    The successive addition of the current address, interpreted as an unsigned number, and an offset, interpreted as a signed number, does not wrap the unsigned address space and remains in bounds of the allocated object. As a corollary, if the added offset is non-negative, the addition does not wrap in an unsigned sense (nuw).
    In cases where the base is a vector of pointers, the inbounds keyword applies to each of the computations element-wise.

We'd be on the safer side by codegen'ing the non-wrapping arithmetic using non-inbounds GEPs and the wrapping one using ptrtoint + inttoptr, the latter is also "safer" from a provenance point of view (but comes at a higher cost, this kind of int<->ptr conversion inhibits some optimizations).

SpexGuy · 2021-04-21T14:46:40Z

Good question. We haven't really thought about this yet. I expect that in the end either +% on pointers will be unsupported or it will do a ptr-int-ptr conversion and allow breaking provenance. In the meantime I would recommend implementing array access the way C does: (ptr + offset).*.

ehaas · 2021-04-21T21:07:16Z

(ptr + offset).* won't work currently if offset is signed (e.g. error: expected type 'usize', found 'c_int' note: unsigned 64-bit int cannot represent all possible signed 32-bit values). I just want to double check that it's safe to cast the offset with @bitCast(usize, @intCast(isize, offset)) before doing the addition? (this is already what we're doing for regular pointer arithmetic - see #8139). The alternative would be to check the value at runtime, absCast it, and add/subtract as appropriate.

SpexGuy · 2021-04-21T21:59:51Z

Oh, my bad, yeah, looks like ptr + isize is a compile error. This is a flaw in the language design, it means there are only two safe ways to do what you want:

// 1. check the sign, good for alias analysis, bad for everything else
res = if (offset >= 0) (ptr + @intCast(usize, offset)) else (ptr - @intCast(usize, -offset));
// 2. use integers, bad for alias analysis, good for everything else
res = @intToPtr(*T, @ptrToInt(ptr) +% @bitCast(usize, offset) * @sizeOf(T));

I'll bring this up at a design meeting, it's the same problem that prevents us from adding a signed value to an unsigned value without either giving up unsigned range or giving up overflow checks.

ehaas · 2021-04-21T22:42:50Z

Cool, I can do option 1 for now. Should I also update the pointer arithmetic code from my previously-linked PR to use the same approach?

SpexGuy · 2021-04-21T22:59:13Z

Yeah, that's probably safer until we figure out what +% actually should mean on pointers.

SpexGuy · 2021-04-22T01:35:01Z

We talked about this at the design meeting, and decided

[*]T + isize should be allowed, and overflow is checked UB if the result doesn't fit in usize
[*]T +% anything should be a compile error

So the long term path here is to use +, but in the short term do what works.

LemonBoy · 2021-04-22T07:43:26Z

should be allowed, and overflow is checked UB if the result doesn't fit in usize

I take this is also the case for a RHS of type usize and also for underflow.

LemonBoy · 2021-04-23T07:00:02Z

There's another problem of little practical interest that I've stumbled into while implementing the safety checks for pointer arithmetic.
As said before we're modeling every pointer access with an inbounds gep and, upon closer reading of the inbounds rules, we're squarely in the undefined-behaviour camp: the gep RHS (sum of all the indices multiplied by the base element size, the LHS is the base pointer) is treated as a signed value therefore limiting the range!
At the moment I have implemented overflow checks following this and the wrap-around rule, but it may make sense to either restrict the RHS to isize only or to drop the inbounds annotation and take the possible performance hit.

andrewrk · 2021-04-23T07:19:30Z

Hmmm. If we used isize for the pointer arithmetic type, then adding ptr + byte_slice.len would be a compile error. That's really unfortunate. And I don't think we can chop a bit off our usizes.

Maybe it makes sense to not do inbounds for pointer arithmetic. We would still be using inbounds for everything else.

I need to think about this for a while. Feel free to propose ideas.

SpexGuy · 2021-04-23T08:24:38Z

Would it make sense to lower pointer addition with a usize as bitCast to isize + add?

LemonBoy · 2021-04-23T10:00:33Z

There's a bit more than this, the use of inbound GEPs limits the addressable memory to the lower half and so array indices, .len are u<address-space-width - 1>.

RogierBrussee · 2021-04-23T10:56:18Z

Without modular aka wrapping pointer arithmetic, how are you going to translate

extern char mem[4];

int main(){
char* p = mem + 2;

//The hardware just does modular aka wrapping arithmetic 
//and that is therefore what the C compiler does 
char* q = p +  ~(uintptr_t)0;
assert (p - 1 == q);

//perfectly unproblematic because q  == &mem[1]

q[0] = 'a';
return 0;

}

(https://godbolt.org/z/f1TY9d88x)

Of course this is a contrived example but someday people will write something like

(char*)p + (offset(struct foo, baz) - offsetof(struct foo, bar))

which (since all arithmetic with unsigned integers in C is modular aka wrapping and the result of offsetof is an unsigned size_t) means, oops, the result of the the subtraction is a very big unsigned number.

Maybe we should pedantically have written

((char*)p + offset(struct foo, baz)) - offsetof(struct foo, bar)

or

(char*)p + (diff_t) ( offset(struct foo, baz) - offsetof(struct foo, bar)).

but the hardware only knowns about modular aka wrapping arithmetic so the difference is harmless.

To define +% for pointers and unsigned integers, if

ptr: [*]T;
uoffset : uXX

then set

ptr +% uoffset == @intToPtr([*]T, @ptrToInt(ptr) +% @as(usize, offset) % @sizeof(T));
ptr -% uoffset == @intToPtr([]T, @ptrToInt(ptr) -% @as(usize, offset) *% @sizeof(T));

while in the signed case

ptr: [*]T ;
soffset : iXX

ptr +% soffset = ptr +% @bitcast(usize, @as(isize, offset))
ptr -% soffset = ptr -% @bitcast(usize, @as(isize, offset)).

and a modular pointer difference

ptr1, ptr2: [*]T;

ptr1 -% ptr2 == @divExact(@as(isize, @ptrToInt(ptr1) -% @ptrToInt(ptr2)), @sizeof(T)) //preserve sign

Defined in this way the expected associativity properties hold for n,m:isize or n, m:usize

ptr +% (n +% m) == (ptr +% n) +% m
ptr +% (n -% m) == (ptr +% n) -% m
ptr -% (n +% m) == (ptr -% n) -% m
ptr -% (n -% m) == (ptr -% n) +% m

Since provided no overflow occurs we have for signed integers @as(isize, n + m) == @as(isize, n) + @as(isize,m ) (and similar for smaller unsigned integers with usize) we have for all integer types and provided no overflow occurs

ptr +% (n + m) == (ptr +% n) +% m

(Note that in terms of modular aka wrapping arithmetic, i.e. if we reduce the offset mod 2^{log2(@Bitsize)} and if there existed modular (aka wrapping) integer type msize (#7512) then one could simply define

[*T] : ptr
offset: msize

ptr +% offset == @intToPtr([*]T, @as(msize, @ptrToInt(ptr)) +% offset *% @sizeof(T))

and all the integer casting in the definitions above just amount to @as(msize, ) (i.e. reduction mod 2^{log2(@Bitsize)}).

Arguably, if you want to stress that pointer arithmetic is an intrinsically dangerous operation that you should try to avoid if you can, but that makes sense on the low, close to the hardware level, only having modular aka wrapping pointer arithmetic +% (and -%, and pointer difference -% with values in msize, or in its absence isize) may be the sane way to go.

ehaas · 2021-04-27T07:11:42Z

The generated code is pretty wild but it works now (on my machine, we'll see what CI says). Hopefully with less undefined behavior.

A rather complicated workaround for handling signed array subscripts. Once `[*]T + isize` is allowed, this can be removed. Fixes ziglang#8556

ziglang#8589 introduced correct handling of signed (possibly negative) array access of pointers. Since unadorned integer literals in C are signed, this resulted in inefficient generated code when indexing a pointer by a non-negative integer literal.

#8589 introduced correct handling of signed (possibly negative) array access of pointers. Since unadorned integer literals in C are signed, this resulted in inefficient generated code when indexing a pointer by a non-negative integer literal.

ehaas marked this pull request as draft April 21, 2021 15:13

ehaas force-pushed the translate-c-negative-array-index branch from b0c35f5 to 780ed01 Compare April 27, 2021 07:09

ehaas marked this pull request as ready for review April 27, 2021 07:10

ehaas force-pushed the translate-c-negative-array-index branch 2 times, most recently from 157e5d3 to 33535d5 Compare May 10, 2021 21:17

Vexu added the translate-c C to Zig source translation feature (@cImport) label May 31, 2021

translate-c: handle signed array subscripts

c6f0b24

A rather complicated workaround for handling signed array subscripts. Once `[*]T + isize` is allowed, this can be removed. Fixes ziglang#8556

andrewrk force-pushed the translate-c-negative-array-index branch from 33535d5 to c6f0b24 Compare July 28, 2021 21:45

andrewrk merged commit 8028e46 into ziglang:master Jul 28, 2021

ehaas deleted the translate-c-negative-array-index branch July 28, 2021 23:41

ehaas mentioned this pull request Aug 2, 2021

translate-c: better codegen for pointer index by int literal #9508

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

translate-c: handle negative array indices #8589

translate-c: handle negative array indices #8589

ehaas commented Apr 21, 2021

LemonBoy commented Apr 21, 2021

SpexGuy commented Apr 21, 2021

ehaas commented Apr 21, 2021

SpexGuy commented Apr 21, 2021

ehaas commented Apr 21, 2021

SpexGuy commented Apr 21, 2021

SpexGuy commented Apr 22, 2021

LemonBoy commented Apr 22, 2021

LemonBoy commented Apr 23, 2021

andrewrk commented Apr 23, 2021

SpexGuy commented Apr 23, 2021

LemonBoy commented Apr 23, 2021

RogierBrussee commented Apr 23, 2021

ehaas commented Apr 27, 2021

translate-c: handle negative array indices #8589

translate-c: handle negative array indices #8589

Conversation

ehaas commented Apr 21, 2021

LemonBoy commented Apr 21, 2021

SpexGuy commented Apr 21, 2021

ehaas commented Apr 21, 2021

SpexGuy commented Apr 21, 2021

ehaas commented Apr 21, 2021

SpexGuy commented Apr 21, 2021

SpexGuy commented Apr 22, 2021

LemonBoy commented Apr 22, 2021

LemonBoy commented Apr 23, 2021

andrewrk commented Apr 23, 2021

SpexGuy commented Apr 23, 2021

LemonBoy commented Apr 23, 2021

RogierBrussee commented Apr 23, 2021

ehaas commented Apr 27, 2021