Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

translate-c: handle negative array indices #8589

Merged
merged 1 commit into from
Jul 28, 2021

Conversation

ehaas
Copy link
Contributor

@ehaas ehaas commented Apr 21, 2021

Take advantage of wrapping pointer arithmetic to enable negative array
subscripts.

Fixes #8556

@LemonBoy
Copy link
Contributor

wrapping pointer arithmetic

Let me CC @SpexGuy here, what does the specification say about pointer arithmetic?
#1918 suggests the wrap-around is unintended and should be fixed (does the + vs %+ distinction apply here?).

From a LLVM point of view we're codegen'ing pointer arithmetic using inbound GEPs and that's quite dangerous as we're potentially violating more than a single requirement:

If the inbounds keyword is present, the result value of the getelementptr is a poison value if one of the following rules is violated:

    The base pointer has an in bounds address of an allocated object, which means that it points into an allocated object, or to its end. The only in bounds address for a null pointer in the default address-space is the null pointer itself.
    If the type of an index is larger than the pointer index type, the truncation to the pointer index type preserves the signed value.
    The multiplication of an index by the type size does not wrap the pointer index type in a signed sense (nsw).
    The successive addition of offsets (without adding the base address) does not wrap the pointer index type in a signed sense (nsw).
    The successive addition of the current address, interpreted as an unsigned number, and an offset, interpreted as a signed number, does not wrap the unsigned address space and remains in bounds of the allocated object. As a corollary, if the added offset is non-negative, the addition does not wrap in an unsigned sense (nuw).
    In cases where the base is a vector of pointers, the inbounds keyword applies to each of the computations element-wise.

We'd be on the safer side by codegen'ing the non-wrapping arithmetic using non-inbounds GEPs and the wrapping one using ptrtoint + inttoptr, the latter is also "safer" from a provenance point of view (but comes at a higher cost, this kind of int<->ptr conversion inhibits some optimizations).

@SpexGuy
Copy link
Contributor

SpexGuy commented Apr 21, 2021

Good question. We haven't really thought about this yet. I expect that in the end either +% on pointers will be unsupported or it will do a ptr-int-ptr conversion and allow breaking provenance. In the meantime I would recommend implementing array access the way C does: (ptr + offset).*.

@ehaas ehaas marked this pull request as draft April 21, 2021 15:13
@ehaas
Copy link
Contributor Author

ehaas commented Apr 21, 2021

(ptr + offset).* won't work currently if offset is signed (e.g. error: expected type 'usize', found 'c_int' note: unsigned 64-bit int cannot represent all possible signed 32-bit values). I just want to double check that it's safe to cast the offset with @bitCast(usize, @intCast(isize, offset)) before doing the addition? (this is already what we're doing for regular pointer arithmetic - see #8139). The alternative would be to check the value at runtime, absCast it, and add/subtract as appropriate.

@SpexGuy
Copy link
Contributor

SpexGuy commented Apr 21, 2021

Oh, my bad, yeah, looks like ptr + isize is a compile error. This is a flaw in the language design, it means there are only two safe ways to do what you want:

// 1. check the sign, good for alias analysis, bad for everything else
res = if (offset >= 0) (ptr + @intCast(usize, offset)) else (ptr - @intCast(usize, -offset));
// 2. use integers, bad for alias analysis, good for everything else
res = @intToPtr(*T, @ptrToInt(ptr) +% @bitCast(usize, offset) * @sizeOf(T));

I'll bring this up at a design meeting, it's the same problem that prevents us from adding a signed value to an unsigned value without either giving up unsigned range or giving up overflow checks.

@ehaas
Copy link
Contributor Author

ehaas commented Apr 21, 2021

Cool, I can do option 1 for now. Should I also update the pointer arithmetic code from my previously-linked PR to use the same approach?

@SpexGuy
Copy link
Contributor

SpexGuy commented Apr 21, 2021

Yeah, that's probably safer until we figure out what +% actually should mean on pointers.

@SpexGuy
Copy link
Contributor

SpexGuy commented Apr 22, 2021

We talked about this at the design meeting, and decided

  • [*]T + isize should be allowed, and overflow is checked UB if the result doesn't fit in usize
  • [*]T +% anything should be a compile error

So the long term path here is to use +, but in the short term do what works.

@LemonBoy
Copy link
Contributor

should be allowed, and overflow is checked UB if the result doesn't fit in usize

I take this is also the case for a RHS of type usize and also for underflow.

@LemonBoy
Copy link
Contributor

There's another problem of little practical interest that I've stumbled into while implementing the safety checks for pointer arithmetic.
As said before we're modeling every pointer access with an inbounds gep and, upon closer reading of the inbounds rules, we're squarely in the undefined-behaviour camp: the gep RHS (sum of all the indices multiplied by the base element size, the LHS is the base pointer) is treated as a signed value therefore limiting the range!
At the moment I have implemented overflow checks following this and the wrap-around rule, but it may make sense to either restrict the RHS to isize only or to drop the inbounds annotation and take the possible performance hit.

@andrewrk
Copy link
Member

Hmmm. If we used isize for the pointer arithmetic type, then adding ptr + byte_slice.len would be a compile error. That's really unfortunate. And I don't think we can chop a bit off our usizes.

Maybe it makes sense to not do inbounds for pointer arithmetic. We would still be using inbounds for everything else.

I need to think about this for a while. Feel free to propose ideas.

@SpexGuy
Copy link
Contributor

SpexGuy commented Apr 23, 2021

Would it make sense to lower pointer addition with a usize as bitCast to isize + add?

@LemonBoy
Copy link
Contributor

There's a bit more than this, the use of inbound GEPs limits the addressable memory to the lower half and so array indices, .len are u<address-space-width - 1>.

@RogierBrussee
Copy link

Without modular aka wrapping pointer arithmetic, how are you going to translate

extern char mem[4];

int main(){
char* p = mem + 2;

//The hardware just does modular aka wrapping arithmetic 
//and that is therefore what the C compiler does 
char* q = p +  ~(uintptr_t)0;
assert (p - 1 == q);

//perfectly unproblematic because q  == &mem[1]

q[0] = 'a';
return 0;

}

(https://godbolt.org/z/f1TY9d88x)

Of course this is a contrived example but someday people will write something like

(char*)p + (offset(struct foo, baz) - offsetof(struct foo, bar))

which (since all arithmetic with unsigned integers in C is modular aka wrapping and the result of offsetof is an unsigned size_t) means, oops, the result of the the subtraction is a very big unsigned number.

Maybe we should pedantically have written

((char*)p + offset(struct foo, baz)) - offsetof(struct foo, bar)

or

(char*)p + (diff_t) ( offset(struct foo, baz) - offsetof(struct foo, bar)).

but the hardware only knowns about modular aka wrapping arithmetic so the difference is harmless.

To define +% for pointers and unsigned integers, if

ptr: [*]T;
uoffset : uXX

then set

ptr +% uoffset == @intToPtr([*]T, @ptrToInt(ptr) +% @as(usize, offset) % @sizeof(T));
ptr -% uoffset == @intToPtr([
]T, @ptrToInt(ptr) -% @as(usize, offset) *% @sizeof(T));

while in the signed case

ptr: [*]T ;
soffset : iXX

ptr +% soffset = ptr +% @bitcast(usize, @as(isize, offset))
ptr -% soffset = ptr -% @bitcast(usize, @as(isize, offset)).

and a modular pointer difference

ptr1, ptr2: [*]T;

ptr1 -% ptr2 == @divExact(@as(isize, @ptrToInt(ptr1) -% @ptrToInt(ptr2)), @sizeof(T)) //preserve sign

Defined in this way the expected associativity properties hold for n,m:isize or n, m:usize

ptr +% (n +% m) == (ptr +% n) +% m
ptr +% (n -% m) == (ptr +% n) -% m
ptr -% (n +% m) == (ptr -% n) -% m
ptr -% (n -% m) == (ptr -% n) +% m

Since provided no overflow occurs we have for signed integers @as(isize, n + m) == @as(isize, n) + @as(isize,m ) (and similar for smaller unsigned integers with usize) we have for all integer types and provided no overflow occurs

ptr +% (n + m) == (ptr +% n) +% m

(Note that in terms of modular aka wrapping arithmetic, i.e. if we reduce the offset mod 2^{log2(@Bitsize)} and if there existed modular (aka wrapping) integer type msize (#7512) then one could simply define

[*T] : ptr
offset: msize

ptr +% offset == @intToPtr([*]T, @as(msize, @ptrToInt(ptr)) +% offset *% @sizeof(T))

and all the integer casting in the definitions above just amount to @as(msize, ) (i.e. reduction mod 2^{log2(@Bitsize)}).

Arguably, if you want to stress that pointer arithmetic is an intrinsically dangerous operation that you should try to avoid if you can, but that makes sense on the low, close to the hardware level, only having modular aka wrapping pointer arithmetic +% (and -%, and pointer difference -% with values in msize, or in its absence isize) may be the sane way to go.

@ehaas ehaas force-pushed the translate-c-negative-array-index branch from b0c35f5 to 780ed01 Compare April 27, 2021 07:09
@ehaas ehaas marked this pull request as ready for review April 27, 2021 07:10
@ehaas
Copy link
Contributor Author

ehaas commented Apr 27, 2021

The generated code is pretty wild but it works now (on my machine, we'll see what CI says). Hopefully with less undefined behavior.

@ehaas ehaas force-pushed the translate-c-negative-array-index branch 2 times, most recently from 157e5d3 to 33535d5 Compare May 10, 2021 21:17
@Vexu Vexu added the translate-c C to Zig source translation feature (@cImport) label May 31, 2021
A rather complicated workaround for handling signed array subscripts.
Once `[*]T + isize` is allowed, this can be removed.

Fixes ziglang#8556
@andrewrk andrewrk force-pushed the translate-c-negative-array-index branch from 33535d5 to c6f0b24 Compare July 28, 2021 21:45
@andrewrk andrewrk merged commit 8028e46 into ziglang:master Jul 28, 2021
@ehaas ehaas deleted the translate-c-negative-array-index branch July 28, 2021 23:41
ehaas added a commit to ehaas/zig that referenced this pull request Aug 2, 2021
ziglang#8589 introduced correct handling of signed (possibly negative) array access
of pointers. Since unadorned integer literals in C are signed, this resulted
in inefficient generated code when indexing a pointer by a non-negative
integer literal.
Vexu pushed a commit that referenced this pull request Aug 3, 2021
#8589 introduced correct handling of signed (possibly negative) array access
of pointers. Since unadorned integer literals in C are signed, this resulted
in inefficient generated code when indexing a pointer by a non-negative
integer literal.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
translate-c C to Zig source translation feature (@cImport)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

translate-c: negative array indices don't work
6 participants