-
Notifications
You must be signed in to change notification settings - Fork 60
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Area proposal: Representation and validity invariants #5
Changes from 5 commits
700a362
0210bd8
3f1c5b6
c63964a
2ffdace
8bff059
ea953e9
26bd2bb
358feac
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,14 +1,152 @@ | ||
# Data structure representation | ||
# Data structure representation and validity requirements | ||
|
||
In general, Rust makes few guarantees about memory layout, unless you | ||
define your structs as `#[repr(rust)]`. But there are some things that | ||
we do guarantee. Let's write about them. | ||
## Introduction | ||
|
||
TODO: | ||
This discussion is meant to focus on two things: | ||
|
||
- Find and link to the various RFCs | ||
- Enumerate things that we *might* in fact guarantee, even for non-C types: | ||
- e.g., `&T` and `Option<&T>` are both pointer sized | ||
- size of `extern fn` etc (at least on some platforms)? | ||
- For which `T` is `None` represented as a "null pointer" etc? | ||
- (Which "niche" optimizations can we rely on) | ||
- What guarantees does Rust make regarding the layout of data structures? | ||
- What guarantees does Rust make regarding ABI compatibility? | ||
- What invariants does the compiler require from the various Rust types? | ||
- the "validity invariant", as defined in [Ralf's blog post][bp] | ||
|
||
NB. The discussion is **not** meant to discuss the "safety invariant" | ||
from [Ralf's blog post][bp], as that can be handled later. | ||
|
||
[bp]: https://www.ralfj.de/blog/2018/08/22/two-kinds-of-invariants.html | ||
|
||
### Layout of data structures | ||
|
||
In general, Rust makes few guarantees about the memory layout of your | ||
structures. For example, by default, the compiler has the freedom to | ||
rearrange the field order of your structures for more efficiency (as | ||
of this writing, we try to minimize the overall size of your | ||
structure, but this is the sort of detail that can easily change). For | ||
safe code, of course, any rearrangements "just work" transparently. | ||
|
||
If, however, you need to write unsafe code, you may wish to have a | ||
fixed data structure layout. In that case, there are ways to specify | ||
and control how an individual struct will be laid out -- notably with | ||
`#[repr]` annotations. One purpose of this section, then, is to layout | ||
what sorts of guarantees we offer when it comes to layout, and also | ||
what effect the various `#[repr]` annotations have. | ||
|
||
### ABI compatibilty | ||
|
||
When one either calls a foreign function or is called by one, extra | ||
care is needed to ensure that all the ABI details line up. ABI compatibility | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Clarification question: "ABI" is always about function calls? The term appears in There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ABI includes data structure layout and other things beyond function calling conventions, yeah. One needs to take care of all these aspects in FFI, but it seems clear that this section is about calling conventions specifically. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I specifically meant the details of function calling here, but I guess I would presume that "ABI" in general refers to how structures in the language are mapped to the underlying architecture. |
||
is related to data structure layout but -- in some cases -- can add another | ||
layer of complexity. For example, consider a struct with one field, like this one: | ||
|
||
```rust | ||
#[repr(C)] | ||
struct Foo { field: u32 } | ||
``` | ||
|
||
The memory layout of `Foo` is identical to a `u32`. But in many ABIs, | ||
the struct type `Foo` is treated differently at the point of a | ||
function call than a `u32` would be. Eliminating these gaps is the | ||
goal of the `#[repr(transparent)]` annotation introduced in [RFC | ||
1758]. For built-in types, such as `&T` and so forth, it is important | ||
for us to specify how they are treated at the point of a function | ||
call. | ||
This comment was marked as resolved.
Sorry, something went wrong.
This comment was marked as resolved.
Sorry, something went wrong. |
||
|
||
### Validity invariant | ||
|
||
The "validity invariant" for each type defines what must hold whenever | ||
a value of this type is considered to be initialized. The compiler expects | ||
the validity invariant to hold **at all times** and is thus allowed to use | ||
these invariants to (e.g.) affect the layout of data structures or do other | ||
optimizations. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I find this paragraph too confusing. It states that the validity invariant must hold at all times, but that it only defines what must hold for initialized values, so I am left wondering what happens with uninitialized values. Do they exist? Is there a distinction between the storage of a value, and the value itself (e.g. uninitialized memory of type It might help to add one sentence stating something about uninitialized values / memory that clarifies things. But I don't know what that might look like. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Perhaps, "at all times when a value the compiler considers a value to be initialized"? This is probably the most precise statement, though perhaps not the most intuitive. It is also the phrasing that @RalfJung used, if I recall. I do want to improve the language, but at the same time, I am not sure how much detail I want to go into in this document. I guess it's good to invest some effort though defining our terms carefully, however. One of the subtle bits -- and I'm not sure how best to phrase this -- is that one of the questions we want to discuss is how to think about loads from uninitialized memory. (e.g., accesses to union fields etc). Most of the time, this is UB, but there are definitely use cases for being able to load from uninitialized memory if you treat the result as an integer (or perhaps any other scalar type where all bit patterns are valid). I think the key point there is that the point where the compiler considers the memory initialized is exactly the point of load. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
That sounds fine. I think one thing I've been missing from the discussion is the difference between typed memory, and whether objects of the type actually live in that memory. Validity could then be about the layout of the objects in memory. This layout depends on which values these objects can represent. Whether we require all typed memory accessible from safe Rust to always contain objects of its type, and whether we allow unsafe Rust to access memory that contains no objects (e.g. uninitialized), would relate to safety, but validity wouldn't talk about that. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. One of the harder questions that I don't think has been sufficiently addressed is what form invalidity should take. Almost all of the discussion has been talking about UB, but there are lots of other notions of "incorrect": The wording of saying that the validity invariant should hold at all times or even at all times when the compiler considers a value to be initialized implies UB, but I generally think this is the wrong choice and we should avoid it if we can get away with it. In my (admittedly limited) experience, actual UB makes things harder to optimize, not easier, since it makes intuitively pure operations have side effects, which makes operations very hard to reorder. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Oh, and it also occurs to me that there are multiple notions of UB, corresponding to the question of whether the println!("Before UB!");
invoke_ub(); That is, does UB invalidate the entire program since the compiler is allowed to assume it never happens, or does is just invalidate the program from the point it happens onward? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That's certainly true of C or LLVM's undefined behaviour, but it isn't the only way incorrect behaviour could be specified. I suppose I shouldn't call it UB, though, since that is confusing (although I think this new notion actually fits the description of undefined behaviour better). Since we are in the process of specifying what Rust is, we get to choose whatever semantics we want (within the bounds of what we can reasonably compile to LLVM). Concretely, the two notions are:
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What would be the value of doing that and what would that cost? Thinking of what would that cost, you showed this snippet:
Imagine However, if the compiler cannot assume that UB does not happen, then suddenly it cannot do that optimization, and potentially it cannot really reorder any code, since that could alter when undefined behavior happens. That's a pretty big cost of optimization potential. Thinking of what does this buy us, I still don't see that it buys us much. If we don't reorder the load, after the print undefined behavior still happens. For all we know that could clear the stdout buffer before its flushed (if the print doesn't flush it). That is, even if we guarantee the print to execute correctly before UB, then the UB afterwards could still allow the print not to show. As I understand, the point of the guidelines or a spec is to define the behavior of Rust programs, and to state which programs are Rust and which programs aren't Rust. Programs with undefined behavior aren't Rust (they are illegal Rust programs). So trying to define the behavior of programs with undefined behavior doesn't make much sense to me, since it appears equivalent to trying to define the behavior of all programs that aren't really Rust. Even if that was interesting, that wouldn't belong in the Rust spec - but in the spec for non-Rust programs. Does that make sense? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Sorry if I wasn't clear: I didn't mean to imply that I think that we should make this specification the only notion of UB in all of Rust, nor even that we necessarily should use this at all. As you mention, there are very real costs, and in many cases (including dereferences of invalid pointers) I don't even see a way to correctly lower to LLVM, since LLVM uses C's stricter notion of UB. I simply meant to say that
That said, obviously part of the reason I bring it up as an option is because I think it does have merit in some cases (though out of the many options, my general favourite is some form of
I disagree. Anything written in Rust is a Rust program, regardless of how legal it is. Now, if a Rust program unconditionally executes undefined behaviour, then it is a completely and utterly meaningless Rust program, but it is still Rust. Throwing up our hands and saying "doing this destroys your program" is a completely reasonable response to certain things, but there is nothing fundamentally different about that choice from saying "doing this crashes your program". They are both specifications, just one very very loose and the other not. In particular, I don't understand your point of view when it comes to programs that are conditionally UB. For example: if user_input() == 5 {
invoke_ub();
} This is certainly a broken program, but that doesn't mean it isn't Rust, and it doesn't mean we don't give it semantics. In particular, if no user ever inputs There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
This is true, but UB is in many ways a "safe" choice because it's the strongest prohibition and can be weakened later. So if we declare everything either "UB" or "totally allowed" in the UC we can revisit it later and tweak that decision, and until we do that we can be sure we haven't painted outselves into a corner that's not implementable or has negative repercussions for optimizations we want to enable. Also consider that weaker prohibitions are
We can usefully talk about the many possible well-defined executions of this programs, but there are possible executions which are undefined and so, without wading into the philosophical discussions, it at least isn't a correct program. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Optimizing away the For example, if your program is run with a seccomp filter on stdout that makes |
||
|
||
Therefore, the validity invariant must **at minimum** justify all the | ||
layout optimizations that the compiler does. We may want a stronger | ||
invariant, however, so as to leave room for future optimization. | ||
|
||
As an example, a value of `&T` type can never be null -- therefore, | ||
`Option<&T>` can use null to represent `None`. | ||
|
||
## Goals | ||
|
||
- Define what we guarantee about the layout of various types | ||
and the effect of `#[repr]` annotations. | ||
- Define the **validity requirements** of various types. These are the | ||
requirements that must hold at all times when the compiler considers | ||
a value to be initialized. | ||
- Also examine when/how we could dynamically check these requirements. | ||
- Uncover the sorts of constraints that we may wish to satisfy in the | ||
future. | ||
|
||
## Some interesting examples and questions | ||
|
||
- `&T` where `T: Sized` | ||
- This is **guaranteed** to be a non-null pointer | ||
- `Option<&T>` where `T: Sized` | ||
- This is **guaranteed** to be a nullable pointer | ||
- `Option<extern "C" fn()>` | ||
- `usize` | ||
- Platform dependent size, but guaranteed to be able to store a pointer? | ||
- Also an array length? | ||
This comment was marked as resolved.
Sorry, something went wrong.
This comment was marked as resolved.
Sorry, something went wrong.
This comment was marked as resolved.
Sorry, something went wrong.
This comment was marked as resolved.
Sorry, something went wrong.
This comment was marked as resolved.
Sorry, something went wrong.
This comment was marked as resolved.
Sorry, something went wrong. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @gnzlbg AFAIK while the constant expression There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @rkruppe indeed, the zero is null is only so for constant expressions . There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That's C++, though, right? For Rust we can just say that a pointer is NULL iff its bits are all 0... There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, we can say that, and we would exclude some (possibly hypothetical) targets/implementations by doing that. That's what needs to be discussed. |
||
- Uninitialized bits -- for which types are uninitialized bits valid? | ||
- If you have `struct A { .. }` and `struct B { .. }` with no | ||
`#[repr]` annotations, and they have the same field types, can we | ||
say that they will have the same layout? | ||
- or do we have the freedom to rearrange the types of `A` but not | ||
`B`, e.g. based on PGO results | ||
This comment was marked as resolved.
Sorry, something went wrong.
This comment was marked as resolved.
Sorry, something went wrong. |
||
- Rust currently says that no single value may be larger than `isize` bytes | ||
- is this good? can it be changed? does it matter *here* anyway? | ||
|
||
## Active threads | ||
|
||
To start, we will create threads for each major categories of types | ||
(with a few suggested focus points): | ||
|
||
- Integers and floating points | ||
This comment was marked as resolved.
Sorry, something went wrong.
This comment was marked as resolved.
Sorry, something went wrong. |
||
- What about uninitialized values? | ||
- Booleans | ||
- Prior discussions ([#46156][], [#46176][]) documented bool as a single | ||
byte that is either 0 or 1. | ||
- Enums | ||
- See dedicated thread about "niches" and `Option`-style layout optimization | ||
below. | ||
- Define: C-like enum | ||
- Can a C-like enum ever have an invalid discriminant? (Presumably not) | ||
- Empty enums and the `!` type | ||
- [RFC 2195][] defined the layout of `#[repr(C)]` enums with payloads. | ||
- [RFC 2363][] offers a proposal to permit specifying discriminations. | ||
- Structs | ||
- Do we ever say *anything* about how a `#[repr(rust)]` struct is laid out | ||
(and/or treated by the ABI)? | ||
- e.g., what about different structs with same definition | ||
- across executions of the same program? | ||
- Tuples | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. A somewhat common request is layout compatibility between homogeneous tuples (i.e., There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. And vector types. A couple of them are already stable (e.g __m128 and friends) |
||
- Are these effectively anonymous structs? | ||
- Unions | ||
- Can we ever say anything about the initialized contents of a union? | ||
- Is `#[repr(C)]` meaningful on a union? | ||
This comment was marked as resolved.
Sorry, something went wrong.
This comment was marked as resolved.
Sorry, something went wrong. |
||
- Fn pointers (`fn()`, `extern "C" fn()`) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. When is transmuting from one There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. added |
||
- References `&T` and `&mut T` | ||
- Out of scope: aliasing rules | ||
- We currently tell LLVM they are aligned and dereferenceable, have to justify that | ||
- Safe code may use them also | ||
- When using the C ABI, these map to the C pointer types, presumably | ||
- Raw pointers | ||
- Effectively same as integers? | ||
- Representation knobs: | ||
- Custom alignment ([RFC 1358]) | ||
- Packed ([RFC 1240] talks about some safety issues) | ||
- ... what else? | ||
|
||
We will also create categories for the following specific areas: | ||
|
||
- Niches: Optimizing `Option`-like enums | ||
- Uninitialized memory: when/where are uninitializes values permitted, if ever? | ||
- ... what else? | ||
|
||
|
||
[#46156]: https://github.com/rust-lang/rust/pull/46156 | ||
[#46176]: https://github.com/rust-lang/rust/pull/46176 | ||
[RFC 2363]: https://github.com/rust-lang/rfcs/pull/2363 | ||
[RFC 2195]: https://rust-lang.github.io/rfcs/2195-really-tagged-unions.html | ||
[RFC 1358]: https://rust-lang.github.io/rfcs/1358-repr-align.html | ||
[RFC 1240]: https://rust-lang.github.io/rfcs/1240-repr-packed-unsafe-ref.html | ||
[RFC 1758]: https://rust-lang.github.io/rfcs/1758-repr-transparent.html |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The list below has three items.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed :)