Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When are things guaranteed to have unique addresses? #206

Open
comex opened this issue Sep 20, 2019 · 21 comments
Open

When are things guaranteed to have unique addresses? #206

comex opened this issue Sep 20, 2019 · 21 comments
Labels
A-pointer-equality Topic: Related to questions of pointer equality/identity C-open-question Category: An open question that we should revisit S-pending-design Status: Resolving this issue requires addressing some open design questions

Comments

@comex
Copy link

comex commented Sep 20, 2019

References to constants are not guaranteed to have unique addresses:

assert!(&2 as *const i32 != &2 as *const i32); // fails

Since consts are just aliases, the same holds for those:

const A: i32 = 2;
const B: i32 = 2;
assert!(&A as *const i32 != &B as *const i32); // fails

What about statics? static variables with interior mutability (and static mut variables) obviously must have unique addresses, but what about ones without?

static A: i32 = 2;
static B: i32 = 2;
assert!(&A as *const i32 != &B as *const i32); // passes

And local variables? (Assuming that both variables are alive at the point of comparison, since obviously variables that have fallen out of scope can have their addresses reused.)

let a = 2;
let b = 2;
assert!(&a as *const i32 != &b as *const i32); // passes

Currently, rustc seems to produce unique addresses in both cases. But @gnzlbg is under the impression that multiple local variables are not guaranteed to have distinct addresses.

Address uniqueness can be a useful property, e.g. if you want a unique 'sentinel' value to assign to a pointer variable. On the other hand, I'd say Rust usually avoids giving much significance to something being a variable as opposed to an expression.

A related issue is #15, which is about whether the address of something can change over time.

Compared to C and C++

In C, rvalues are not implicitly bound to addresses unless assigned to a variable (or a C99 compound literal). C appears to guarantee that distinct variables have distinct addresses.

In C++, rvalues can be implicitly bound to const references, which gives them an address: this is "temporary materialization" and creates a "temporary object". Like C, the C++ spec guarantees that distinct "objects" "compare unequal", so I think this assertion is guaranteed to pass (not sure though):

#include <assert.h>
void foo(const int &a, const int &b) {
    assert(&a != &b);
}

int main() {
    foo(2, 2);
}

In practice, this means that the compiler always stores a copy of the constant on the stack and takes the address of that, rather than directly referencing a static allocation.

@gnzlbg
Copy link
Contributor

gnzlbg commented Sep 21, 2019

But @gnzlbg is under the impression that multiple local variables are not guaranteed to have distinct addresses.

FWIW I'm under that impression because I can't find such a guarantee being written down anywhere (I looked in the book, the reference, and the nomicon). In this case,

let mut a = 2;
let mut b = 2;
assert_ne!(&mut a as *mut _ as usize, &mut b as *mut _ as usize); // PASS

we do guarantee that the assert never fails because we guarantee that two aliasing &mut T cannot be created in safe Rust code. However, in this case:

let a: Freeze = val;
let b: Freeze = val;
assert_ne!(&a as *const _ as usize, &b as *const _ as usize);

I don't think it would be unsound for a and b to have the same observable addresses.

If we allow that, as long as the allocation cannot be modified (e.g. because the allocation is immutable and it does not contain an UnsafeCell), then all allocations with the same value can always be merged, independently of whether the program might try to observe their addresses or not.

I have no clue whether this is worth doing, but in general I find that code that relies on the relative addresses of let bindings on the stack to be brittle anyways.

Address uniqueness can be a useful property, e.g. if you want a unique 'sentinel' value to assign to a pointer variable

Just keep in mind that the address is only unique while the let binding is alive, e.g.,

pub fn foo() -> bool {  baz(bar()) }
pub fn bar() -> *const i32 {
    let x = 42;
    &x as *const _ // this address is only unique until the end of bar!
}
pub fn baz(x: *const i32) -> bool {
    let x = 42;
    assert_ne!(&x as *const _ as usize, x as usize)
    // ^ this can fail since both addresses are not guaranteed to be different
}

References to constants are not guaranteed to have unique addresses:

Note that there are no references to constants: &CONST is just sugar for let x = CONST; &x, and &mut CONST for let mut x = CONST; &mut x, so the behavior that you are observing for those is the same as the corresponding let bindings.

@gnzlbg
Copy link
Contributor

gnzlbg commented Sep 21, 2019

Compared to C and C++

One main difference between C and C++ is that they do not have zero-sized types (although C++ has EBO / [[no_unique_address]]). In Rust, [ZST; N] already creates N objects all having the same address. So (playground):

#[derive(Copy, Clone)] struct ZST;
fn main() { 
    let mut xs = [ZST; 2];
    assert_eq!(
        &mut xs[0] as *mut _ as usize, 
        &mut xs[1] as *mut _ as usize
    );
}

passes. I don't see why this couldn't happen for multiple let bindings in the stack to ZSTs.

@RalfJung
Copy link
Member

RalfJung commented Oct 9, 2019

The entire purpose of static is for them to be items with a fixed stable address in memory, so I think that's pretty much a guarantee. By this I mean that every static is its own disjoint "allocated object" with the size given by its type. That means that taking the address of the same static twice during execution will give the same address. However, distinct statics can still have the same address if one of them is a ZST.

For let, I would also argue that these refer to distinct stack slots, so here:

let a: Freeze = val;
let b: Freeze = val;
assert_ne!(&a as *const _ as usize, &b as *const _ as usize);

we guarantee that addresses are different if the type has a size of at least 1. I find it hard to imagine a semantics that lets us overlap their storage here.

let-bound variables are also separate "allocated objects", but what makes this tricky is their lifetime (as in the time when they get allocated and freed, which has little to do with lifetimes in Rust's type systems). Separate allocated objects are only guaranteed to have distinct addresses if their size is non-zero and their lifetimes overlap. For let-bound variables, they might be live for shorter than the duration of the function call, and moreover a let-bound variable might switch between live and non-live any number of times within a function call, and it becomes a new allocated object (with a possibly distinct address) each time. This is hard to specify on the surface language (and we might not want to commit to all the details), but we can be fairly precise on the MIR level: StorageLive allocates the object for a local, StorageDead deletes it, and we might get a StorageRefresh as well which semantically is just sugar for StorageDead; StroageLive and thus means the allocated object can move to a new location.

In Rust, [ZST; N] already creates N objects all having the same address.

Operationally really [ZST; N] is just a NOP. So I wouldn't say that any objects are being created here. Rust doesn't really have a notion of "object" other than "allocated object", and let x = [ZST; N] just creates one allocated object of size 0 (no matter than N).

@Diggsey
Copy link

Diggsey commented Oct 10, 2019

Address uniqueness can be a useful property, e.g. if you want a unique 'sentinel' value to assign to a pointer variable. On the other hand, I'd say Rust usually avoids giving much significance to something being a variable as opposed to an expression.

Are there any other uses of address uniqueness? Maybe instead of coming up with complicated rules for when addresses are unique and then being limited by those rules for backwards compatibility, we could have an attribute or something to opt-in to a variable having a unique address?

@RalfJung
Copy link
Member

I think the rules will only become even more complicated if we try to relax them. Remember, Rust is not specified axiomatically by saying "these properties hold for all program executions"; there is an Abstract Machine with an operational specification -- something you can put into an interpreter -- that explains all Rust behavior. So you'd have to propose some mechanism e.g. in Miri to actually observe overlapping addresses for such variables.

@RalfJung RalfJung added C-open-question Category: An open question that we should revisit A-memory Topic: Related to memory accesses labels Oct 10, 2019
@JakobDegen
Copy link
Contributor

JakobDegen commented Mar 28, 2022

I was thinking about this the other day, and have some thoughts. For now, I am going to assume two things 1) local variables have stable addresses (not that the alternative might not be just as interesting). 2) A strict provenance model like @Gankra proposes. If we allow some int to pointer casts, some of these options may fall away.

First thing first: I believe there is absolutely no need for distinct allocations to have distinct addresses - we have provenance to disambiguate. As I understand SB, the "provenance" of a pointer is an integer that identifies an item in the borrow stacks, and these integers never repeat. Consequently, there should be no problem saying that "when a pointer is dereferenced, we search all the bytes that have the same address as the pointer, and see if there is an item in any of the borrow stacks that makes the access legal. There can be at most one, since no pointer can have provenance to more than one allocation." This might feel a little surprising, but I don't think it's as bad as it initially sounds; it might even be a way to drive home the "memory is not flat" point. (this also plays nicely with the mental model that Gankra proposed for memory, where it's a two dimensional grid of address x provenance).

Now the probably more difficult question: What do we want to guarantee? I am for now only going to think about stack local variables; there might be interesting (different) arguments for other categories of allocations. I see at least a few possibilities, but am completely undecided myself.

Because I'll be talking about some optimizations, I'll need to differentiate between "live range in the abstract machine" and "live range as reported by compiler analyses." I'll refer to the first as "scope" and the second as "liveness."

We do not guarantee that simultaneously in scope locals have distinct addresses

This has the benefit of enabling optimizations. The stack slot in this code cannot be re-used:

let mut x = 5;
foo(&x);
x = 10;
let y = 5;
foo(&y);

It would be possible and actually fairly easy for a Rust compiler to see that x is dead when y is created; the x = 10; invalidates any other pointers to it. However, foo may compute the addresses of x and y, and so they can't overlap (and the pointers are valid to be dereferenced, so we can't lie to foo either). Interestingly though, this optimization does not require the assignment to x. It would also be legal to use a single stack slot in this case:

let x = 5;
let y = 5;
foo(&x, &y);

That is simultaneously more powerful but also potentially more surprising. I'm not sure how much benefit the above two optimizations give. However, I could see the following optimization being potentially more useful:

let mut x = input();
foo(&x);
x += 10;
let y = x;
bar(&y);

here, the optimization would not be to re-use a stack slot (which has relatively small benefits), but to be able to merge x and y at the MIR level entirely. This avoids a copy and also enables future optimizations in a very significant way.

We do guarantee that simultaneously in scope locals have distinct addresses

The main benefit of this is to disable the potentially surprising optimizations above. I had asked about use cases for such a guarantee on discord (besides not having to go "wtf is the compiler doing"), and something like HashMap<*const i32, T> came up. I guess I could see that being a useful type, but it's fairly limited - keep in mind that the pointers have to be in different allocations for the question in this issue to matter.

This additionally has the downside of making MIR storage markers be statements that have significant semantics. In other words,

StorageLive(_3);
StorageDead(_4);

could not be freely re-ordered. This is related to and discussed in rust-lang/rust#68622 .

Some alternative?

We could try and define the guarantees around here in terms of some analysis or other conditions. I've talked to Ralf enough that my instinctive reaction to that is now also "that's not an operational semantics," but I actually think the need for a real operational semantics might be reduced here - the values of the addresses are implementation defined anyway. @moulins had suggested the following on discord:

For two pointers p and q, if:

  • p and q have provenance for one byte, and
  • p and q have a valid tag in their borrow stack
    Then, either:
  • p as usize == q as usize and p and q alias, or
  • p as usize != q as usize and p and q don't alias

This would maybe allow some of the optimizations, but I have some concerns about this definition; at least as I understand it, there's an implicit requirement here of "two pointers that exist at the same time." But it's not clear to me how we should define this concept without typed memory - pointers only exist as values temporarily, most of the time there are just pointer bytes in memory that do not necessarily correspond to an actual value.

In any case, there might be some idea here that I haven't thought of

@scottmcm
Copy link
Member

The entire purpose of static is for them to be items with a fixed stable address in memory, so I think that's pretty much a guarantee.

Is there anything that would keep us from merging impl Freeze statics? Certainly if they have unsafe cells then we shouldn't merge them, but for something like [u32; 4] it seems like we could reasonably merge things -- even have a different static be in the middle of another, so long as it's the right value.

(But it also seems fine to say "well just use const if you want that".)

@Lokathor
Copy link
Contributor

That does sound fine for any read-only static.

@thomcc
Copy link
Member

thomcc commented Mar 29, 2022

It seems better to require const for this. i can think of cases in C++ where a static is used just to generate a value so that the pointer is used as an identity. I think it would be confusing to have this require UnsafeCell even in the case where it's never written.

That said, I don't feel that strongly here... but I suspect in practice this would be pretty low value of an optimization TBH.

@comex
Copy link
Author

comex commented Mar 29, 2022

The defmt crate currently relies on identical statics not being combined, though its statics are unusual in having both link_section and export_name attributes.

@RalfJung
Copy link
Member

RalfJung commented Apr 1, 2022

We do guarantee that simultaneously in scope locals have distinct addresses

What I expected cold be phrased as "simultaneously live locals have distinct addresses. That would still allow your first and 3rd optimization, but not the 2nd.

@JakobDegen
Copy link
Contributor

Well, maybe, but that would require a definition of liveness on the AM, which seems non-trivial

@digama0
Copy link

digama0 commented Apr 1, 2022

It's pretty trivial to define liveness if the input MIR has StorageLive and StorageDead annotations. But it does mean that liveness analysis needs to be done in an early pass while generating MIR, because shrinking live ranges by moving these annotations around is not an allowable optimization.

@JakobDegen
Copy link
Contributor

Well, storage statements early on are no different from defining it based on scope

@RalfJung
Copy link
Member

RalfJung commented Apr 1, 2022

Well, maybe, but that would require a definition of liveness on the AM, which seems non-trivial

Ah yes, I should have specified -- as @digama0 said, I imagine on the MIR level we have explicit liveness annotations. This basically moves the liveness analysis to Rust → MIR lowering and out of the operational semantics. Maybe that's cheating, maybe that's elegant -- I have not decided yet. ;)

Well, storage statements early on are no different from defining it based on scope

Yes it is different; your first example is easy to support with this approach. The third one is tricky since assigning from x to y means they have to be briefly live at the same time, not sure if we could make that work.

@JakobDegen
Copy link
Contributor

Oh, I see, you're suggesting doing a MaybeLiveLocals (or similar) at the time of MIR building (or equivalent) and defining our constraints based on that. That seems... arbitrary to me. The fact that x is not live until the end of the scope in the first example is imo an optimization. It feels weird to permit this specific optimization but not others

@RalfJung
Copy link
Member

RalfJung commented Apr 1, 2022

It's not quite arbitrary IMO, it is sort of the minimal guarantee we can make if we want to ensure that different variables that might get used at the same time will never be on the same address.

@JakobDegen
Copy link
Contributor

Maybe, but even then. What would the actual spec say?

@RalfJung
Copy link
Member

RalfJung commented Apr 1, 2022

It would describe the Rust → MIR lowering, including the algorithm that adds the liveness statements.

@JakobDegen
Copy link
Contributor

So, while this was not the original intent of the question, one thing that we probably have to do is figure out how this:

fn nop(a: &mut i32) {}

fn bad(limit: usize) {
    if limit == 0 {
        return;
    }
    let mut x: u32 = 0;
    let mut r = &mut x;
    bad(limit - 1);
    nop(r);
    return;
}

fn main() {
    bad(usize::MAX);
    println!("unreachable");
}

Printing "unreachable" when compiled on -Copt-level=3 is sound. This is very clearly allocating more than usize:MAX bytes simultaneously, so what is going on in the AM when this runs?

@RalfJung
Copy link
Member

I think that's a separate subject -- removing allocations (whether on the stack or on the heap) is, eh, tricky to justify in the best case if you have finite memory. So, this is the same as LLVM optimizing the following to print "unreachable":

int *x = malloc(SIZE_MAX);
int *y = malloc(SIZE_MAX);
if (x && y) {
  printf("This is unreachable. I could deref a NULL pointer and this program would still be fine.");
}

So, while this was not the original intent of the question

That should have been a sign to create a new issue instead. :) See #328.

@RalfJung RalfJung added the S-pending-design Status: Resolving this issue requires addressing some open design questions label Aug 29, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-pointer-equality Topic: Related to questions of pointer equality/identity C-open-question Category: An open question that we should revisit S-pending-design Status: Resolving this issue requires addressing some open design questions
Projects
None yet
Development

No branches or pull requests

9 participants