-
Notifications
You must be signed in to change notification settings - Fork 12.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clarify str::from_utf8_unchecked's invariants #95895
Conversation
Specifically, make it clear that it is immediately UB to pass ill-formed UTF-8 into the function. The previous wording left space to interpret that the UB only occurred when calling another function, which "assumes that `&str`s are valid UTF-8." This does not change whether str being UTF-8 is a safety or a validity invariant. (As per previous discussion, it is a safety invariant, not a validity invariant.) It just makes it clear that valid UTF-8 is a precondition of str::from_utf8_unchecked, and that emitting an Abstract Machine fault (e.g. UB or a sanitizer error) on invalid UTF-8 is a valid thing to do. If user code wants to create an unsafe `&str` pointing to ill-formed UTF-8, it must be done via transmutes. Also, just, don't.
r? @kennytm (rust-highfive has picked a reviewer for you, use r? to override) |
@bors r+ rollup |
📌 Commit b92cd1a has been approved by |
Clarify str::from_utf8_unchecked's invariants Specifically, make it clear that it is immediately UB to pass ill-formed UTF-8 into the function. The previous wording left space to interpret that the UB only occurred when calling another function, which "assumes that `&str`s are valid UTF-8." This does not change whether str being UTF-8 is a safety or a validity invariant. (As per previous discussion, it is a safety invariant, not a validity invariant.) It just makes it clear that valid UTF-8 is a precondition of str::from_utf8_unchecked, and that emitting an Abstract Machine fault (e.g. UB or a sanitizer error) on invalid UTF-8 is a valid thing to do. If user code wants to create an unsafe `&str` pointing to ill-formed UTF-8, it must be done via transmutes. Also, just, don't. Zulip discussion: https://rust-lang.zulipchat.com/#narrow/stream/136281-t-lang.2Fwg-unsafe-code-guidelines/topic/str.3A.3Afrom_utf8_unchecked.20Safety.20requirement
Clarify str::from_utf8_unchecked's invariants Specifically, make it clear that it is immediately UB to pass ill-formed UTF-8 into the function. The previous wording left space to interpret that the UB only occurred when calling another function, which "assumes that `&str`s are valid UTF-8." This does not change whether str being UTF-8 is a safety or a validity invariant. (As per previous discussion, it is a safety invariant, not a validity invariant.) It just makes it clear that valid UTF-8 is a precondition of str::from_utf8_unchecked, and that emitting an Abstract Machine fault (e.g. UB or a sanitizer error) on invalid UTF-8 is a valid thing to do. If user code wants to create an unsafe `&str` pointing to ill-formed UTF-8, it must be done via transmutes. Also, just, don't. Zulip discussion: https://rust-lang.zulipchat.com/#narrow/stream/136281-t-lang.2Fwg-unsafe-code-guidelines/topic/str.3A.3Afrom_utf8_unchecked.20Safety.20requirement
Rollup of 7 pull requests Successful merges: - rust-lang#95008 ([`let_chains`] Forbid `let` inside parentheses) - rust-lang#95801 (Replace RwLock by a futex based one on Linux) - rust-lang#95864 (Fix miscompilation of inline assembly with outputs in cases where we emit an invoke instead of call instruction.) - rust-lang#95894 (Fix formatting error in pin.rs docs) - rust-lang#95895 (Clarify str::from_utf8_unchecked's invariants) - rust-lang#95901 (Remove duplicate aliases for `check codegen_{cranelift,gcc}` and fix `build codegen_gcc`) - rust-lang#95927 (CI: do not compile libcore twice when performing LLVM PGO) Failed merges: r? `@ghost` `@rustbot` modify labels: rollup
/// This function is unsafe because it does not check that the bytes passed to | ||
/// it are valid UTF-8. If this constraint is violated, undefined behavior | ||
/// results, as the rest of Rust assumes that [`&str`]s are valid UTF-8. | ||
/// | ||
/// [`&str`]: str | ||
/// The bytes passed in must be valid UTF-8. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't exactly right: A &str
is allowed to contain invalid utf-8, but other functions might assume that a &str
is valid utf-8, making a non-utf-8 &str
very hard to safely use. But just calling this function with invalid utf-8 is, by itself, not unsafe.
See also e.g. https://doc.rust-lang.org/stable/std/string/struct.String.html#method.from_utf8_unchecked
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is true, but this is perhaps even handled just by if the validity of &T
is independent from whether the pointee bytes are valid at type T
.
In general, from_unchecked
functions do just say that it is UB to provide an argument which does not satisfy the safety invariant. This is useful, because it allows a sanitizing implementation of the function which checks the precondition.
A postcondition of "is not used in any way which causes UB" is much more difficult to reason about.
Another point is that str
has the option of using &*(v as *const [u8] as *const str)
to construct a &str
to invalid-UTF-8. String
doesn't have any such API, relying on conversion to/from Vec<u8>
.
If any str
/String
methods actually documented that they were safe to call with a reference to invalid UTF-8, then this weaker documentation requirement makes sense. As is, the only possible thing to do with a str
/String
to invalid UTF-8 is to forget
it. With that the case, the clearer precondition seems better to use.
Specifically, make it clear that it is immediately UB to pass ill-formed UTF-8 into the function. The previous wording left space to interpret that the UB only occurred when calling another function, which "assumes that
&str
s are valid UTF-8."This does not change whether str being UTF-8 is a safety or a validity invariant. (As per previous discussion, it is a safety invariant, not a validity invariant.) It just makes it clear that valid UTF-8 is a precondition of str::from_utf8_unchecked, and that emitting an Abstract Machine fault (e.g. UB or a sanitizer error) on invalid UTF-8 is a valid thing to do.
If user code wants to create an unsafe
&str
pointing to ill-formed UTF-8, it must be done via transmutes. Also, just, don't.Zulip discussion: https://rust-lang.zulipchat.com/#narrow/stream/136281-t-lang.2Fwg-unsafe-code-guidelines/topic/str.3A.3Afrom_utf8_unchecked.20Safety.20requirement