-
Notifications
You must be signed in to change notification settings - Fork 58
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Validity of integers and floating point #71
Comments
Arguments for allowing uninitialized bits:
Arguments for disallowing uninitialized bits:
|
Here's an example of a piece of code that relies on uninitialized integers (and raw pointers, and |
An interesting use of uninitialized bits in integers is in crossbeam's But this means that with a type like Cc @stjepang |
If we were to allow uninitialized bits, it might be reasonable to say, at least initially, that any operation that's not a read or a write is UB. That would allow us to define those operations later on. For example, that would mean that adding an initialized integer with an uninitialized one is UB, but later on, we could define that to something else, like the result of that operation being uninitialized. That is, we wouldn't need to answer these hard questions right now. I wonder whether it is possible to allow uninitialized bits later on in a backwards compatible way or whether we do have to make this decision right now ? I think we have to make this decision right now, because, e.g. if we forbid uninitialized bits, all unsafe code assumes that integers are always initialized, and we can't change that later to support uninitialized bits without breaking that assumption. In the same way, if we allow uninitialized bits, we can't later on disallow them without breaking code that uses them.
Agreed.
I don't think it can flag every occurence, for example, |
I don't see how. If a type's validity invariant rules out certain initialized bit patterns, unsafe code can use those bit patterns for its own purposes, which will clash with later allowing those bit patterns in the validity invariant. However, uninitialized bits -- even if they are valid -- cannot be detected by the program: reading them is either UB or perhaps produces some string or zeros or ones (possibly a non-deterministic one). So that kind of counter-example is right out. Moreover, because uninitialized bits will never be safe, unsafe code can't be fed them from unknown/external sources. IOW the only way a library/module/etc. will ever see uninitialized bits is if it produces them itself or obtains them from its "trusted base" (e.g., a function whose functional correctness is relevant to the soundness of the library/module/etc.), in which case it's UB today and its problem is an internal bug, not the interface with the rest of the world. |
I would be inclined to permit uninitialized data in integers. My reasoning is as follows: I think that backwards compatibility around Second, I think that the crossbeam example which @RalfJung raised earlier feels like the kind of tricky thing that people shouldn't have to fret about. In particular, if you have uninitialized padding bits in structs and things like that which are known to be less than word size, I think you should be able to treat them as integers for convenience, and it seems like that would be Insta-UB under this proposal. Basically, I think people are likely to do things (like crossbeam) that wind up using uninitialized bits in integers but which don't necessarily "feel" like cases where That said, I find the most compelling argument against permitting uninitialized bits to be that it would allow us to declare such uses an error. But it seems like we could still have a sort of "lint", where we say "ah, you are using uninitialized data outside of a One meta-question: Suppose that I do have a clever algorithm that makes use of uninitialized integers. For example, a trick I have used in a past life was to have an integer set that had O(1) initialization cost regardless of its capacity. This worked by having two vectors of integers. One of them started out with size N but had uninitialized data. The other started out as size 0 but had initialized data. To add to the set, you checked one against the other (I can give details if desired). The key point is that you had to read and use an uninitialized integer in the process and compare it against initialized data -- is reading such an integer UB? |
The crossbeam example and the "O(1) initialization" integer set, as well as all other clever uses of uninitalized memory that I'm aware of (i.e., not just allocating uninitialized memory and initializing it at your own pace before using it) require operating on uninitialized bits. So if we want to allow them, we need to not only allow reading uninitalized memory but also make arithmetic and comparisons on them defined (and reasonably deterministic! So even though I agree that it would be best to support these things, I do not see a reasonable way to achieve that. |
Notice that this only applies to the So we have a two-level discussion here:
I'd prefer the answer to the first question to be "no" so that the second question doesn't even need answering, but unfortunately @nikomatsakis has some pretty good arguments. ;) Coming to the second question, as @rkruppe said these patterns (I think you are talking about https://research.swtch.com/sparse, right?) are still not legal. Comparing an uninitialized integer with anything is either UB or produces an uninitialized boolean, branching on which is UB. But once LLVM has |
Hm, if getting people to use such an intrinsic is acceptable (I think the biggest source of worry is people reasoning naively about uninitalized as "initialized to an arbitrary bit string" and not even checking), then we can already build such an intrinsic today, it just has to do something that all LLVM optimizations have to assume could initialized the memory (e.g., some inline asm). This will have some unfortunate impact on optimizations unrelated to uninitialized memory, but it will still be localized to uses of that intrinsic. |
Fair, but I see no model that makes this work. This is actually also a reason why I'd prefer to not allow uninitialized data in integers -- that may be more surprising, but it is easier to explain and very concise: "No uninitialized data outside If we allow uninitialized bits but then almost all operations are UB, it becomes something more like "No uninitialized data outside Basically, we are breaking expectations anyway, and maybe it is better to break them more but in simpler ways, than to figure out how to break them as little as possible while still breaking them in ways that are much more complicated to explain. That seems plausible to me. Not sure if it makes any sense.^^ |
If uninitialized bits in an integer are made instantly invalid, is it possible to do the It seems like it should be possible to use an atomic I'm personally on the side of making uninitialized integers invalid so long as we don't lose anything (but |
The case of I've been toying with the idea of a /// Writes 0 to all padding bytes in `val` and returns a mutable slice of the bytes in `val`.
fn clear_padding_bytes<T>(val: &mut T) -> &mut [u8]; |
No. Notice though that the state of this trick (even the load/store part) is dubious in LLVM as well -- poison is infecting the entire value when just one of the bytes loaded from memory is poisoned, and there are proposals to replace undef by poison.
Well, "solved". As usual with C++, it is somewhat unclear what this actually means in an operational way. I would not call this a solution.
I don't think this is implementable -- at run-time there is no way to distinguish padding bytes from initialized bytes. But a version of this which just picks arbitrary bit patterns for all uninitialized and padding bytes would be implementable, that's exactly what |
I don't see what the problem is? The compiler knows the layout of type
In an operational sense this would mean setting the padding bytes to 0 on all input values to an atomic operation. This will cause padding bytes to be "ignored" by the compare_exchange hardware instruction since they will always be 0. |
Oh, you mean statically -- well, for enums that'll require dynamic checks as well. And of course this is hopeless for unions.
That would also guarantee that the value you read has 0 for all padding bytes, which I am fairly sure they do not want to guarantee. And anyway I see no way to implement this behavior even remotely efficiently. I think a more reasonable operational version amounts to saying that you freeze both values before comparing -- that at least avoids comparing uninitialized bytes, and it makes compiling to a simple CAS correct. But it allows spurious comparison failures and makes no guarantees that your retrying CAS loop will ever succeed, because you could see a different frozen value each time around the loop. Also this assumes the atomic operation knows the correct type, whereas AFAIK for LLVM atomic operations only work on integer types -- at which point you cannot know which bytes are padding. |
I believe spurious comparison failures can be fixed like so: // Assume `T` can be transmuted into `usize`.
fn compare_and_swap(&self, current: T, new: T) -> T {
// Freeze `current` and `new` and transmute them into `usize`s.
let mut current: usize = freeze_and_transmute(current);
let new: usize = freeze_and_transmute(new);
loop {
unsafe {
// `previous` is already frozen because we only store frozen values into `inner`.
let previous = self.inner.compare_and_swap(current, new, SeqCst);
// If `previous` and `current` are byte-equal, then CAS succeeded.
if previous == current {
return transmute(previous);
}
// If `previous` and `current` are semantically equal, but differ in uninitialized bits...
let previous_t: T = transmute(previous);
let current_t: T = transmute(current);
if previous_t == current_t {
// Then try again, but substitute `current` for `previous`.
current = previous;
continue;
}
// Otherwise, CAS failed and we return `previous`.
return transmute(previous);
}
}
} Now it's still possible to have a spurious failure in the first iteration of the loop, but the second one will definitely succeed (unless the atomic was concurrently modified). In fact, this is exactly how CAS in |
I think this is the key invariant here -- if you can make that work, then yes I can think there is a consistent semantics here. However, notice that you have |
We seem to be trying to answer two different questions here that are entangled. One is whether integers and such can be uninitialized or not. The other one, which is most fundamental, is whether we want to support using uninitialized memory outside unions or not, in general. I think we should answer this question first, and use the answer to drive the rest of the design. We can't think of allowing uninitialized memory on integers without also considering that the validity of integers and raw pointers is going to be alike, so we are also allowing uninitialized memory on raw pointers. Raw pointers can point to DSTs, so we also need to be thinking whether we want to allow uninitialized memory in pointers to DSTs (for the whole pointer, some part of it, etc.). I agree with @nikomatsakis that a lot of code is using The issue that some algorithms require uninitialized memory has shown up. The question that hasn't been answered yet AFAICT is whether |
Just to clarify for everyone else following this thread: @RalfJung and I discussed this and the conclusion was that we'll simply remove |
About whether uninitialized data is okay in integers at all: we discussed this at the all-hands. The general consensus seems to be that we should permit uninitialized data in integers and raw pointers. There is just too much existing code doing stuff like let x: [u8; 256] = mem::uninitialized();
// go on or let x: SomeFfiStruct = mem::uninitialized();
// go on Both of these patterns would be insta-UB if we disallow uninitialized integers. That doesn't seem worth the benefit of better error-checking with Miri. Incidentally, that matches what Miri already currently implements, mostly for pragmatic reasons (libstd already violated the rules about uninitialized integers when I wrote the checks -- I think I have since moved it to |
Was it also discussed whether this was worth the benefit of a simpler and more teachable correctness model for unsafe code ? (independently of whether this model can be better checked with miri or not?).
I think so too. |
Is "integers must be initialized" really that much simpler than "integers are allowed to not be initialized"? |
No, but I do think that "uninitialized memory is only allowed inside unions" is much much simpler and teachable than all other alternatives that are currently being discussed. |
Another (drastic?) option to consider that allows Disallow uninitialized ( This may degrade some performance around uses of Also it could degrade talking about "uninitialized memory", because it then introduces both the "frozen uninitialized" behind
|
@CAD97 Good idea! Basically, once we have rust-lang/rust#58363, we could reimplement @Amanieu suggested we should allow uninitialized values in integers only for "backwards compatibility reasons". This would have a similar effect. It might, however, incur a performance cost on such existing code. Porting code to However, I suspect people will still want to keep memory unfrozen when e.g. calling a known |
We should definitely do this.
+1.
The So I find the argument that "integers and pointers shall support uninitialized bit-patterns because there is too much code using |
For those who might be a bit confused because the conversation bounced around to lots of different places, a lot of the interesting cases are spelled out very explicitly in MaybeUninit's docs. For instance, the "I want to Read into an uninitialized buffer" case is explicitly called out as incorrect here: https://doc.rust-lang.org/std/mem/union.MaybeUninit.html#incorrect-usages-of-this-method-1 |
In that issue when someone asked about the exact situation I referenced you told them it was off topic and to discuss it in this issue. |
You are quoting me incorrectly. :) |
So, no, this is not "the exact situation" you referenced. It is a very different situation. |
rust-lang/rust#98919 asks the lang team to decide that uninit integers (and floats and raw pointers) are UB. |
98919 was closed with "yes, uninit ints/floats/raw pointers are instant UB to move" so the answer to
is "no", so we don't need to answer how it behaves. Should this get closed, then? Not sure when UCG issues get closed, but this looks like it's been decided by the lang team. |
They get closed when we have done the writeup. ... which we haven't done in years... |
From reading through this thread, there seems to have been a pretty dramatic reversal from the original loose consensus of "it isn't worth effort/complexity to make uninitialized integers UB". As far as I can tell, this was partially due to the fact that LLVM is making its optimization passes more conservative, and will require an explicit I'm not trying to re-litigate the already-completed FCP. However, I'm concerned that this is going to be difficult to teach - I suspect people will fall back to their (incorrect) mental model of "a u8 can hold any initialized byte, and an uninitialized byte is 'just an arbitrary byte', so an uninitialized u8 is fine". Is there a concrete example of an optimization we can point to that explains why this behavior is useful? Ideally, it would:
My overall concern is that people might (implicitly) decide that this is too "weird" or "theoretical" without a concrete justification for this "weirdness", and write incorrect code that never gets run under Miri. Of course, undefined behavior is defined by the behavior of the Abstract Machine, but I think being able to say "this 'weirdness' is required for the following useful snippet to be optimized well" will really help to reinforce that in practice. |
The obvious example is the Itanium Not a Thing, where using uninitialized data will actually cause a CPU-level trap. The usual concrete example of uninitialized memory is reading from MADV_FREEd pages returning nondeterministic results based on what the state of the page allocation is. AIUI, the main benefit of This probably wouldn't be that big of a deal for just There's also the fact that we will replace by-value passing with by-ref passing when the value is large. This can be optimized into a continued move — but if one of the steps is And it's not a super strong motivator, but it'd be an annoying footgun if |
I don't really find your examples that convincing, TBH:
|
One example is running the code under Valgrind. Since |
Note that this is wrong even if we ignore uninit memory -- |
The model of But I assume that people who make the assumption that |
Not quite -- clang adds |
At this point I've mostly given up on this topic and accepted that it's going to be UB, but I do feel like it is very close to a breaking change -- it feels like there's almost no way that someone could predict in, say, 2016 that uninitialized Even after that, for a long time (for most of the period this thread has been open, for example), the concensus was on the side of "this is probably fine" too... It does feel like soon after LLVM adds optimizations based on (Separately, I do (strongly) think there is a need to support an unsafe |
If someone had designed a proper op.sem in 2016, I think they would have quickly come to the conclusion that at least this is an unanswered question and we should clearly tell people to avoid doing this until there is an answer. But sadly unsafe Rust started with a pre-rigorous phase and it'll take a while for that to shake out. This was definitely predictable when this paper got published, but it took a while for that realization to spread and even longer until suggesting 'uninit integers are UB' seemed like a proposal that has realistic chances of being accepted in Rust. I thought for a long time that maybe we can allow 'fully uninit integers' without too much cost, but then it turns out that even that already blocks a bunch of optimizations in safe code that we really want. I know it's painful but I also don't know what else we could have done, other than stabilizing Regarding optimizations, note that this is less "LLVM becoming more conservative" and more "LLVM fixing optimizations that caused real-world miscompilations". I don't know why you framed this in such a strange way, @Aaron1011. |
@RalfJung: I didn't mean to imply that the reasons for |
I don't think we can, what does this have to do with references?
So should we say that uninit integers are fine only on the stack but not on the heap? That sounds quite terrible. We could declare MADV_FREE memory entirely unsupported in Rust. But this is going down the train of not having poison/undef in the language, which I think is he wrong direction -- there's a reason basically every compiler has such a concept, they are crucial for getting good codegen. I also think a magic
I think it's a bit unfair to say that we are doing this based on what's "most convenient". If we had declared that integer types can hold uninit memory, we would have had bugreports saying "I wrote this safe code and codegen didn't produce the results I wanted", and we'd have to explain each time that it's because other people write unsafe code that we wanted to allow and we can't have our cake and eat it, too. There's no easy answer here, and "make safe code pay in performance for things unsafe code sometimes wants to do" is not obviously a good outcome, either. I fully agree though that we need to have firmer semantics that people can rely on, that's the entire point of the vast majority of my work on Rust. |
FWIW, I think it's kind of the contrary: I find the Anyway, regarding freeze, I think something that people don't realize is that Doing a There is, at present, no way to support "uninitialized memory" in LLVM that is both efficient (i.e. does not end up just initializing memory anyway) and does not leak things like undef/poison semantics into Rust's semantics (with all the weirdness that comes with it, like So just to spell this out clearly, I believe the options you get (from an LLVM perspective) are:
|
Closing as answered |
Discussing the validity invariant of integer and floating point types.
Clearly, every possible bit pattern is allowed. For integers they all have a distinct and meaningful interpretation, and we have a safe NOP-conversion between
f32
andu32
, andf64
andu64
, throughto_bits
andfrom_bits
.The remaining open question is: is it ever allowed to have an uninitialized bit in an integer or floating point value? We could reasonably decide either way. Also, when an integer is partially uninitialized, does that "infect" the entire integer or do we exactly preserve which byte is initialized?
2022-09-07: This has now pretty much been answered.
The text was updated successfully, but these errors were encountered: