Clarify str::from_utf8_unchecked's invariants #95895

CAD97 · 2022-04-10T20:05:29Z

Specifically, make it clear that it is immediately UB to pass ill-formed UTF-8 into the function. The previous wording left space to interpret that the UB only occurred when calling another function, which "assumes that &strs are valid UTF-8."

This does not change whether str being UTF-8 is a safety or a validity invariant. (As per previous discussion, it is a safety invariant, not a validity invariant.) It just makes it clear that valid UTF-8 is a precondition of str::from_utf8_unchecked, and that emitting an Abstract Machine fault (e.g. UB or a sanitizer error) on invalid UTF-8 is a valid thing to do.

If user code wants to create an unsafe &str pointing to ill-formed UTF-8, it must be done via transmutes. Also, just, don't.

Zulip discussion: https://rust-lang.zulipchat.com/#narrow/stream/136281-t-lang.2Fwg-unsafe-code-guidelines/topic/str.3A.3Afrom_utf8_unchecked.20Safety.20requirement

Specifically, make it clear that it is immediately UB to pass ill-formed UTF-8 into the function. The previous wording left space to interpret that the UB only occurred when calling another function, which "assumes that `&str`s are valid UTF-8." This does not change whether str being UTF-8 is a safety or a validity invariant. (As per previous discussion, it is a safety invariant, not a validity invariant.) It just makes it clear that valid UTF-8 is a precondition of str::from_utf8_unchecked, and that emitting an Abstract Machine fault (e.g. UB or a sanitizer error) on invalid UTF-8 is a valid thing to do. If user code wants to create an unsafe `&str` pointing to ill-formed UTF-8, it must be done via transmutes. Also, just, don't.

rust-highfive · 2022-04-10T20:05:32Z

r? @kennytm

(rust-highfive has picked a reviewer for you, use r? to override)

Dylan-DPC · 2022-04-11T13:46:58Z

@bors r+ rollup

bors · 2022-04-11T13:47:01Z

📌 Commit b92cd1a has been approved by Dylan-DPC

Clarify str::from_utf8_unchecked's invariants Specifically, make it clear that it is immediately UB to pass ill-formed UTF-8 into the function. The previous wording left space to interpret that the UB only occurred when calling another function, which "assumes that `&str`s are valid UTF-8." This does not change whether str being UTF-8 is a safety or a validity invariant. (As per previous discussion, it is a safety invariant, not a validity invariant.) It just makes it clear that valid UTF-8 is a precondition of str::from_utf8_unchecked, and that emitting an Abstract Machine fault (e.g. UB or a sanitizer error) on invalid UTF-8 is a valid thing to do. If user code wants to create an unsafe `&str` pointing to ill-formed UTF-8, it must be done via transmutes. Also, just, don't. Zulip discussion: https://rust-lang.zulipchat.com/#narrow/stream/136281-t-lang.2Fwg-unsafe-code-guidelines/topic/str.3A.3Afrom_utf8_unchecked.20Safety.20requirement

Rollup of 7 pull requests Successful merges: - rust-lang#95008 ([`let_chains`] Forbid `let` inside parentheses) - rust-lang#95801 (Replace RwLock by a futex based one on Linux) - rust-lang#95864 (Fix miscompilation of inline assembly with outputs in cases where we emit an invoke instead of call instruction.) - rust-lang#95894 (Fix formatting error in pin.rs docs) - rust-lang#95895 (Clarify str::from_utf8_unchecked's invariants) - rust-lang#95901 (Remove duplicate aliases for `check codegen_{cranelift,gcc}` and fix `build codegen_gcc`) - rust-lang#95927 (CI: do not compile libcore twice when performing LLVM PGO) Failed merges: r? `@ghost` `@rustbot` modify labels: rollup

m-ou-se · 2022-06-27T16:14:16Z

library/core/src/str/converts.rs

-/// results, as the rest of Rust assumes that [`&str`]s are valid UTF-8.
-///
-/// [`&str`]: str
+/// The bytes passed in must be valid UTF-8.


This isn't exactly right: A &str is allowed to contain invalid utf-8, but other functions might assume that a &str is valid utf-8, making a non-utf-8 &str very hard to safely use. But just calling this function with invalid utf-8 is, by itself, not unsafe.

See also e.g. https://doc.rust-lang.org/stable/std/string/struct.String.html#method.from_utf8_unchecked

This is true, but this is perhaps even handled just by if the validity of &T is independent from whether the pointee bytes are valid at type T.

In general, from_unchecked functions do just say that it is UB to provide an argument which does not satisfy the safety invariant. This is useful, because it allows a sanitizing implementation of the function which checks the precondition.

A postcondition of "is not used in any way which causes UB" is much more difficult to reason about.

Another point is that str has the option of using &*(v as *const [u8] as *const str) to construct a &str to invalid-UTF-8. String doesn't have any such API, relying on conversion to/from Vec<u8>.

If any str/String methods actually documented that they were safe to call with a reference to invalid UTF-8, then this weaker documentation requirement makes sense. As is, the only possible thing to do with a str/String to invalid UTF-8 is to forget it. With that the case, the clearer precondition seems better to use.

rust-highfive assigned kennytm Apr 10, 2022

rust-highfive added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Apr 10, 2022

bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Apr 11, 2022

Dylan-DPC mentioned this pull request Apr 11, 2022

Rollup of 7 pull requests #95938

Closed

This was referenced Apr 11, 2022

Rollup of 7 pull requests #95941

Closed

Rollup of 7 pull requests #95944

Merged

bors merged commit ae6f75a into rust-lang:master Apr 12, 2022

rustbot added this to the 1.62.0 milestone Apr 12, 2022

m-ou-se reviewed Jun 27, 2022

View reviewed changes

CAD97 deleted the patch-2 branch June 27, 2022 22:37

CAD97 mentioned this pull request Jun 27, 2022

Clarify String::from_utf8_unchecked's invariants #98596

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Clarify str::from_utf8_unchecked's invariants #95895

Clarify str::from_utf8_unchecked's invariants #95895

Uh oh!

CAD97 commented Apr 10, 2022

Uh oh!

rust-highfive commented Apr 10, 2022

Uh oh!

Dylan-DPC commented Apr 11, 2022

Uh oh!

bors commented Apr 11, 2022

Uh oh!

m-ou-se Jun 27, 2022

Uh oh!

CAD97 Jun 27, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

Clarify str::from_utf8_unchecked's invariants #95895

Clarify str::from_utf8_unchecked's invariants #95895

Uh oh!

Conversation

CAD97 commented Apr 10, 2022

Uh oh!

rust-highfive commented Apr 10, 2022

Uh oh!

Dylan-DPC commented Apr 11, 2022

Uh oh!

bors commented Apr 11, 2022

Uh oh!

m-ou-se Jun 27, 2022

Choose a reason for hiding this comment

Uh oh!

CAD97 Jun 27, 2022

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants