remove language-level UB for non-UTF-8 str #71033

RalfJung · 2020-04-11T16:11:21Z

This is the Rust-side issue for rust-lang/reference#792 just so that we can use fcpbot. The change description follows.

Ever since Rust 1.0, the reference said that a non-UTF-8 str causes immediate UB. In terms of today's terminology, that means that str has a validity invariant of being valid UTF-8.

However, that seems unnecessary: the compiler does not actually exploit this, nor is there any clear way it could exploit this. Making UTF-8 a library-level safety invariant is more than enough for everything str does. Most likely, it was made a validity invariant because we had not yet properly teased apart those two concepts when the document was initially written.

This is also the conclusion that the UCG WG arrived at in rust-lang/unsafe-code-guidelines#78.

I therefore propose we remove the UTF-8 clause from the language spec, so that str will have the same validity invariant as [u8].

The text was updated successfully, but these errors were encountered:

lcnr · 2020-04-11T16:48:24Z

Would this mean that the following stops being UB?

use std::mem;

fn main() {
    // Being valid utf8 is still a safety invariant of `str`.
    // As any method using `str` may depend on this invariant,
    // it would still be UB to use `str::from_utf8_unchecked`
    // or `str::as_bytes` here. 
    let s: &str = unsafe { std::mem::transmute(b"\xff\xff" as &[u8]) };
    let bytes: &[u8] = unsafe { std::mem::transmute(s) };
    assert_eq!(bytes, &[0xff, 0xff][..]);
}

RalfJung · 2020-04-11T17:07:25Z

@lcnr yes exactly.

it would still be UB to use str::from_utf8_unchecked
or str::as_bytes here.

To be more precise, it would be library UB -- the way the methods are implemented right now, there is actually no language-level UB and nothing that Miri could possibly find, but the library is permitted to change in the future in ways that would make this language UB.

Centril · 2020-04-11T18:17:19Z

Dear community and language team.

As Ralf notes, there are no compelling reasons to keep this Undefined Behavior (UB) at the level of the abstract machine in terms of the validity invariant of str. Therefore, keeping it UB at this level only complicates the language definition instead with no notable benefits.

Instead, we can make this a library invariant, and leave it as "library UB" or "unspecified behavior". Indeed, this is probably what we always meant by the note in the reference. I hereby propose that we accept that new definition:

@rfcbot merge

rfcbot · 2020-04-11T18:17:20Z

Team member @Centril has proposed to merge this. The next step is review by the rest of the tagged team members:

No concerns currently listed.

Once a majority of reviewers approve (and at most 2 approvals are outstanding), this will enter its final comment period. If you spot a major issue that hasn't been raised at any point in this process, please speak up!

See this document for info about what commands tagged team members can give me.

eddyb · 2020-04-12T02:22:51Z

there is actually no language-level UB and nothing that Miri could possibly find

This leads to an interesting question: what can miri find?
My first guess would be some unsafe { unreachable_unchecked() } call in UTF-8 decoding.

It would be great to figure out if there is something like this that miri does detect, even if miri stops checking str values for UTF-8 validity altogether, and use it as a test case.

RalfJung · 2020-04-12T07:06:20Z

It would be great to figure out if there is something like this that miri does detect, even if miri stops checking str values for UTF-8 validity altogether, and use it as a test case.

Note that Miri does not check behind references, so while str would be checked, that type is basically unused, and &str is not checked.

RalfJung · 2020-04-12T08:01:28Z

This leads to an interesting question: what can miri find?
My first guess would be some unsafe { unreachable_unchecked() } call in UTF-8 decoding.

I scrolled over str/mod.rs to see how the invariant gets used. I certainly missed some things, but here is what stood out:

Calling char::from_u32_unchecked, which must be a valid unicode codepoint. Miri checks this.
The searching/splitting talks a lot about indices being at unicode boundaries. I do not know what happens if they are not.
I suspect somewhere it might also lead to out-of-bounds accesses when it thinks there is a 4-byte character following, but only 2 bytes are left in the buffer.
I did not find any unreachable/unreachable_unchecked.

joshtriplett · 2020-04-12T20:31:01Z

This seems entirely reasonable to me. If you never call any of str's functions, just storing non-UTF-8 in it shouldn't cause any issue.

I wonder if we might be able, in the future, to carefully exclude a few of str's functions from the "library UB" requirements.

cc @BurntSushi: Would this change potentially simplify bstr?

RalfJung · 2020-04-12T22:48:26Z

I wonder if we might be able, in the future, to carefully exclude a few of str's functions from the "library UB" requirements.

Yes I think that is definitely possible. It is somewhat similar to, for example, how we promise that Vec::push will keep existing pointers working if it does not reallocate (though that is AFAIK not very clearly documented): library methods can make extra promises beyond the ways they could be used in safe code.

steveklabnik · 2020-04-13T02:45:07Z

Shouldn't this be an RFC, rather than just an issue on the Rust tracker?

Centril · 2020-04-13T02:50:47Z

The RFC format feels excessive as a matter of "filling in the forms" (I don't know that it would add anything). The design choices here aren't too complicated and it doesn't feel like there's any opposition thus far to the change here, given that LLVM et. al cannot exploit this. The FCP should provide the same chance of community input here I feel given the usual coverage in TWiR etc.

nikomatsakis · 2020-04-13T14:34:43Z

I think the main point is that all the library's routines are available as public routines to be invoked on any &str. So this means that having a str which is exposed to safe code must meet those invariants. This doesn't necessarily have to come from a validity invariant, but it is still a core propery of str.

BurntSushi · 2020-04-13T16:56:36Z

I think this change makes a lot of sense. I've always kind of wondered why str's UTF-8 invariant was defined at the language level. Although, since this is moving a language invariant to a library invariant, it kind of feels like the libs team should be consulted? cc @rust-lang/libs And in particular, it would be great to hear from @SimonSapin

Also, are there any changes the str API docs in std that need to be updated as a result of this change? (It looks like the current docs don't really mention UB much here.)

cc @BurntSushi: Would this change potentially simplify bstr?

Possibly, but it's tricky business. For example, one might reasonably exclude substring searching from UB if neither the needle nor the haystack were valid UTF-8. Then bstr could use std's substring searching instead of rolling its own. But, bstr already has a substring searcher, so it's somewhat already been paid.

bstr would still need to roll UTF-8 decoding to implement its own version of chars, for example, since I chars is probably a method that you want to keep UB when str isn't valid UTF-8.

SimonSapin · 2020-04-13T18:42:22Z

This change sounds reasonable to me from the point of view of how the language is defined. I don’t expect it will have much or any practical impact for users of the language.

scottmcm · 2020-04-16T19:11:44Z

I assume this would also be a stepping-stone to allowing str to become just struct str([u8]);?

~~@rfcbot reviewed~~

scottmcm · 2020-04-16T19:47:44Z

Hmm, actually, we talked about this in the lang meeting today and Josh checked and UTF-8 encoding units do have a niche that 11111xxx is never used. So one possibility would be to add a new Utf8EncodingUnit type with that as a validity invariant (like we have a validity invariant on NonZeroU8), and say that str is a newtype over [Utf8EncodingUnit].

However, this proposed change would prevent us from doing that in future, as I'd be allowed to make a slice full of 255_u8 in unsafe code and call it a str, which would violate that hypothetical validity invariant, so I wanted to bring it up. [Edit: Oh, in fact such an example is in the first reply to this thread. If that were made legal it could not be reasonably become UB again in future.]

I do strongly agree that the cross-byte UTF-8 well-formedness should not be in the validity invariant.

eddyb · 2020-04-16T20:12:19Z

@scottmcm Hmm, so 0xf8..=0xff are invalid UTF-8 bytes?

I haven't tried it yet but this should work today:

#[rustc_layout_scalar_valid_range_end(0xf7)]
struct Utf8EncodingUnit(u8);

I'm not sure if we can take advantage of this in any way, but it might be possible to force miri into validating it without any special-casing (just based on the validity range).

Nemo157 · 2020-04-16T20:16:30Z

Making even that guarantee would presumably make any use of str::as_bytes_mut as UB as it probably is today (I believe that after the proposed changes it would become non-UB to use as long as you fix the bytes before the &mut [u8] is dropped, which is something that I would like to have code relying on).

EDIT: I guess it would fix the second example in the docs to be non-UB, but not being able to pass the &mut [u8] into other safe code would be very limiting still.

RalfJung · 2020-04-16T20:17:23Z

I'm not sure if we can take advantage of this in any way, but it might be possible to force miri into validating it without any special-casing (just based on the validity range).

Miri validates rustc_layout_scalar_valid_range_end without any extra work. But only by-value, not behind references.

SimonSapin · 2020-04-16T20:18:04Z

Can that niche ever be useful? Option<str> cannot exist since str: !Sized

SimonSapin · 2020-04-16T20:32:01Z

Hmm, so 0xf8..=0xff are invalid UTF-8 bytes?

Correct. Those are 0b11111000..=0b11111111, and per https://tools.ietf.org/html/rfc3629#section-3 all well-formed byte sequences are one of:

   Char. number range  |        UTF-8 octet sequence
      (hexadecimal)    |              (binary)
   --------------------+---------------------------------------------
   0000 0000-0000 007F | 0xxxxxxx
   0000 0080-0000 07FF | 110xxxxx 10xxxxxx
   0000 0800-0000 FFFF | 1110xxxx 10xxxxxx 10xxxxxx
   0001 0000-0010 FFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

(For historical anecdote: Unicode is limited to U+0000..=U+10FFFF because of UTF-16, but the original design of UTF-8 supported up to U+7FFFFFFF (31 bits) with leading bytes 0xf8..=0xfd: https://tools.ietf.org/html/rfc2279#section-2. I think Unicode needing to grow beyond a million assigned code point is not something we need to worry about, given https://www.unicode.org/versions/stats/chart_charbyyear.html. And anyway Rust also limits char to U+10FFFF max.)

eddyb · 2020-04-16T22:40:54Z

@SimonSapin That's not the only thing valid ranges do, we also expose them to LLVM.
So depending on what the str methods look like, they might be able to take advantage of this to consider the values outside of the range impossible, without using unreachable_unchecked.
Specifically, when reading UTF-8 bytes from behind a &str.

SimonSapin · 2020-04-16T23:10:39Z

I suppose LLVM could then eliminate a branch like str.as_bytes()[0] == 0xff but I have a hard time imagining a situation where it would come up in practice outside of test cases.

nikomatsakis · 2020-04-20T14:13:40Z

Actually, @RalfJung, I realize I'm a bit uncertain about how validity invariants interact with borrows -- if I have a local variable that x: T that is mutably borrowed, must the validity invariant for type T hold at every moment, or only when the borrow is ended (by x being used again)? This seems relevant to race conditions and the like.

nikomatsakis · 2020-04-20T14:13:46Z

Perhaps better to move that off line =)

RalfJung · 2020-04-20T16:39:27Z

@nikomatsakis I think your question is tracked at rust-lang/unsafe-code-guidelines#84 :)

rfcbot · 2020-04-22T16:38:08Z

🔔 This is now entering its final comment period, as per the review above. 🔔

Kixunil · 2020-04-23T19:13:40Z

Would be great to make extra sure people don't mistake those two definitions of UB and don't start recklessly working with non-utf-8 data in string thinking it's not UB anymore.

jdahlstrom · 2020-04-24T22:39:16Z

A minor point regarding terminology: Aren't these "UTF8 encoding units" called (UTF-8) code units in established Unicode parlance? Which, incidentally, is probably an argument in favor of adding a named type as it would simply reify an existing concept rather than introducing an ad-hoc one!

scottmcm · 2020-04-25T00:01:22Z

EDIT: Skip my post and read the next one

~~@jdahlstrom Good point: http://www.unicode.org/glossary/#code_unit~~

~~Utf8CodeUnit would be a much better name.~~

SimonSapin · 2020-04-25T04:04:08Z

Yes, “code unit” is established in Unicode but the code unit for UTF-8 is u8, not a specialization of u8 that has validity invariants of its own. It’s only for a UTF-8 sequence of bytes (a.k.a. 8-bit code units) that Unicode defines being “well-formed”.

The formal definitions for this start at https://www.unicode.org/versions/Unicode13.0.0/ch03.pdf#G7404

RalfJung · 2020-04-25T10:57:09Z

FCP was originally requested for changing the validity invariant of str to that of [u8], but during discussion consensus seems to have shifted towards rather using that of [Utf8Byte] with

#[rustc_layout_scalar_valid_range_end(0xf7)]
struct Utf8Byte(u8);

(the name of that type is still up for bikeshedding)

I am a bit confused now about what this FCP is actually deciding, if/when it completes.

crlf0710 · 2020-04-26T02:07:15Z

Mmm, then maybe it makes sense to also remove language-level UB for non Unicode scalar chars too?

programmerjake · 2020-04-26T04:49:19Z

Mmm, then maybe it makes sense to also remove language-level UB for non Unicode scalar chars too?

Rust does exploit the limited values that can be stored in a char:

Here, returning None for Option<char> is the same as returning 0x110000u32 which is one past the largest unicode scalar value.
https://play.rust-lang.org/?version=stable&mode=release&edition=2018&gist=ad1776519737e67fbe35bf76bb8451c4

SimonSapin · 2020-04-26T12:25:52Z

@RalfJung I think there’s consensus that such a definition for Utf8Byte would not be wrong or harmful. It’s much less clear (at least to me) that it’s actually useful. What actual optimization would this rustc_layout_scalar_valid_range_end (potentially) enable, given that str only ever exists behind a pointer indirection such as &str?

Process-wise, my understanding is that an FCP finishing means accepting a proposal as it was when that FCP was proposed. If new information or consensus emerges later (especially if it’s after some votes) and a team member feels the original proposal should not be accepted, they should file a concern or cancel the FCP. (And potentially propose a new FCP for a different proposal.)

Nemo157 · 2020-04-28T11:07:28Z

If there's no potential optimizations to apply it seems better to completely remove the validity invariant as proposed, to allow temporarily using the backing buffer from str::as_bytes_mut with less manually checked safety. With a non-compiler-checked validity invariant on those bytes it makes working with an &mut str much more error-prone.

Alternatively, exposing Utf8Byte, an unsafe fn as_utf8_bytes_mut(&mut self) -> &mut [Utf8Byte], some checked operations for Utf8Byte, and some way to wrap an arbitrary &mut [u8] into an &mut [Utf8Byte] (I'm not sure if pre-validating then transmuting this would be valid even with repr(transparent) with the additional layout constraints); all together would allow building useful APIs on &mut str with less unsafe code.

nikomatsakis · 2020-04-28T16:55:49Z

I agree with @SimonSapin. I would not bother with the "encoding units" myself, which seems like an extra layer of complexity that doesn't add much of practical value.

rfcbot · 2020-05-02T16:44:44Z

The final comment period, with a disposition to merge, as per the review above, is now complete.

As the automated representative of the governance process, I would like to thank the author for their work and everyone else who contributed.

The RFC will be merged soon.

RalfJung · 2020-05-02T18:12:50Z

So the FCP that passed means the decision was that str is like [u8]?

nikomatsakis · 2020-05-11T15:59:58Z

I believe so.

RalfJung mentioned this issue Apr 11, 2020

remove language-level UB for non-UTF-8 str rust-lang/reference#792

Merged

jonas-schievink added A-Unicode Area: Unicode T-lang Relevant to the language team, which will review and decide on the PR/issue. C-enhancement Category: An issue proposing an enhancement or a PR with one. labels Apr 11, 2020

rfcbot added proposed-final-comment-period Proposed to merge/close by relevant subteam, see T-<team> label. Will enter FCP once signed off. disposition-merge This issue / PR is in PFCP or FCP with a disposition to merge it. labels Apr 11, 2020

Centril added the I-nominated label Apr 11, 2020

rfcbot added final-comment-period In the final comment period and will be merged soon unless new substantive objections are raised. and removed proposed-final-comment-period Proposed to merge/close by relevant subteam, see T-<team> label. Will enter FCP once signed off. labels Apr 22, 2020

joshtriplett removed the I-nominated label Apr 23, 2020

rfcbot added finished-final-comment-period The final comment period is finished for this PR / Issue. and removed final-comment-period In the final comment period and will be merged soon unless new substantive objections are raised. labels May 2, 2020

joshtriplett closed this as completed in rust-lang/reference#792 May 11, 2020

Mark-Simulacrum mentioned this issue Jul 15, 2020

Update docs for str::as_bytes_mut. #74381

Merged

RalfJung mentioned this issue Oct 10, 2020

Validity of references: Memory-related properties rust-lang/unsafe-code-guidelines#77

Closed

eddyb mentioned this issue Jul 30, 2021

Use ValTree in all type level constants #83234

Closed

paulocsanz mentioned this issue Dec 13, 2022

Safety + Miri paulocsanz/arraystring#14

Merged

dtolnay mentioned this issue Feb 15, 2023

Treat str as containing [u8] for auto trait purposes #107941

Merged

Maxdamantus mentioned this issue Oct 21, 2023

Clarity on units of string WebAssembly/stringref#65

Open

zachs18 mentioned this issue Dec 21, 2024

Update/relax str/String utf8 safety docs #134598

Open

remove language-level UB for non-UTF-8 str #71033

remove language-level UB for non-UTF-8 str #71033

Comments

RalfJung commented Apr 11, 2020

lcnr commented Apr 11, 2020

RalfJung commented Apr 11, 2020

Centril commented Apr 11, 2020

rfcbot commented Apr 11, 2020 • edited by cramertj Loading

eddyb commented Apr 12, 2020

RalfJung commented Apr 12, 2020

RalfJung commented Apr 12, 2020

joshtriplett commented Apr 12, 2020

RalfJung commented Apr 12, 2020

steveklabnik commented Apr 13, 2020

Centril commented Apr 13, 2020

nikomatsakis commented Apr 13, 2020

BurntSushi commented Apr 13, 2020

SimonSapin commented Apr 13, 2020

scottmcm commented Apr 16, 2020 • edited Loading

scottmcm commented Apr 16, 2020 • edited Loading

eddyb commented Apr 16, 2020

Nemo157 commented Apr 16, 2020 • edited Loading

RalfJung commented Apr 16, 2020

SimonSapin commented Apr 16, 2020

SimonSapin commented Apr 16, 2020

eddyb commented Apr 16, 2020

SimonSapin commented Apr 16, 2020

nikomatsakis commented Apr 20, 2020

nikomatsakis commented Apr 20, 2020

RalfJung commented Apr 20, 2020

rfcbot commented Apr 22, 2020

Kixunil commented Apr 23, 2020

jdahlstrom commented Apr 24, 2020 • edited Loading

scottmcm commented Apr 25, 2020 • edited Loading

SimonSapin commented Apr 25, 2020

RalfJung commented Apr 25, 2020 • edited Loading

crlf0710 commented Apr 26, 2020

programmerjake commented Apr 26, 2020

SimonSapin commented Apr 26, 2020

Nemo157 commented Apr 28, 2020

nikomatsakis commented Apr 28, 2020

rfcbot commented May 2, 2020

RalfJung commented May 2, 2020

nikomatsakis commented May 11, 2020

rfcbot commented Apr 11, 2020 •

edited by cramertj

Loading

scottmcm commented Apr 16, 2020 •

edited

Loading

scottmcm commented Apr 16, 2020 •

edited

Loading

Nemo157 commented Apr 16, 2020 •

edited

Loading

jdahlstrom commented Apr 24, 2020 •

edited

Loading

scottmcm commented Apr 25, 2020 •

edited

Loading

RalfJung commented Apr 25, 2020 •

edited

Loading