Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

remove language-level UB for non-UTF-8 str #71033

Closed
RalfJung opened this issue Apr 11, 2020 · 45 comments · Fixed by rust-lang/reference#792
Closed

remove language-level UB for non-UTF-8 str #71033

RalfJung opened this issue Apr 11, 2020 · 45 comments · Fixed by rust-lang/reference#792
Labels
A-Unicode Area: Unicode C-enhancement Category: An issue proposing an enhancement or a PR with one. disposition-merge This issue / PR is in PFCP or FCP with a disposition to merge it. finished-final-comment-period The final comment period is finished for this PR / Issue. T-lang Relevant to the language team, which will review and decide on the PR/issue.

Comments

@RalfJung
Copy link
Member

This is the Rust-side issue for rust-lang/reference#792 just so that we can use fcpbot. The change description follows.

Ever since Rust 1.0, the reference said that a non-UTF-8 str causes immediate UB. In terms of today's terminology, that means that str has a validity invariant of being valid UTF-8.

However, that seems unnecessary: the compiler does not actually exploit this, nor is there any clear way it could exploit this. Making UTF-8 a library-level safety invariant is more than enough for everything str does. Most likely, it was made a validity invariant because we had not yet properly teased apart those two concepts when the document was initially written.

This is also the conclusion that the UCG WG arrived at in rust-lang/unsafe-code-guidelines#78.

I therefore propose we remove the UTF-8 clause from the language spec, so that str will have the same validity invariant as [u8].

@lcnr
Copy link
Contributor

lcnr commented Apr 11, 2020

Would this mean that the following stops being UB?

use std::mem;

fn main() {
    // Being valid utf8 is still a safety invariant of `str`.
    // As any method using `str` may depend on this invariant,
    // it would still be UB to use `str::from_utf8_unchecked`
    // or `str::as_bytes` here. 
    let s: &str = unsafe { std::mem::transmute(b"\xff\xff" as &[u8]) };
    let bytes: &[u8] = unsafe { std::mem::transmute(s) };
    assert_eq!(bytes, &[0xff, 0xff][..]);
}

@jonas-schievink jonas-schievink added A-Unicode Area: Unicode T-lang Relevant to the language team, which will review and decide on the PR/issue. C-enhancement Category: An issue proposing an enhancement or a PR with one. labels Apr 11, 2020
@RalfJung
Copy link
Member Author

@lcnr yes exactly.

it would still be UB to use str::from_utf8_unchecked
or str::as_bytes here.

To be more precise, it would be library UB -- the way the methods are implemented right now, there is actually no language-level UB and nothing that Miri could possibly find, but the library is permitted to change in the future in ways that would make this language UB.

@Centril
Copy link
Contributor

Centril commented Apr 11, 2020

Dear community and language team.

As Ralf notes, there are no compelling reasons to keep this Undefined Behavior (UB) at the level of the abstract machine in terms of the validity invariant of str. Therefore, keeping it UB at this level only complicates the language definition instead with no notable benefits.

Instead, we can make this a library invariant, and leave it as "library UB" or "unspecified behavior". Indeed, this is probably what we always meant by the note in the reference. I hereby propose that we accept that new definition:

@rfcbot merge

@rfcbot
Copy link

rfcbot commented Apr 11, 2020

Team member @Centril has proposed to merge this. The next step is review by the rest of the tagged team members:

No concerns currently listed.

Once a majority of reviewers approve (and at most 2 approvals are outstanding), this will enter its final comment period. If you spot a major issue that hasn't been raised at any point in this process, please speak up!

See this document for info about what commands tagged team members can give me.

@rfcbot rfcbot added proposed-final-comment-period Proposed to merge/close by relevant subteam, see T-<team> label. Will enter FCP once signed off. disposition-merge This issue / PR is in PFCP or FCP with a disposition to merge it. labels Apr 11, 2020
@eddyb
Copy link
Member

eddyb commented Apr 12, 2020

there is actually no language-level UB and nothing that Miri could possibly find

This leads to an interesting question: what can miri find?
My first guess would be some unsafe { unreachable_unchecked() } call in UTF-8 decoding.

It would be great to figure out if there is something like this that miri does detect, even if miri stops checking str values for UTF-8 validity altogether, and use it as a test case.

@RalfJung
Copy link
Member Author

It would be great to figure out if there is something like this that miri does detect, even if miri stops checking str values for UTF-8 validity altogether, and use it as a test case.

Note that Miri does not check behind references, so while str would be checked, that type is basically unused, and &str is not checked.

@RalfJung
Copy link
Member Author

This leads to an interesting question: what can miri find?
My first guess would be some unsafe { unreachable_unchecked() } call in UTF-8 decoding.

I scrolled over str/mod.rs to see how the invariant gets used. I certainly missed some things, but here is what stood out:

  • Calling char::from_u32_unchecked, which must be a valid unicode codepoint. Miri checks this.
  • The searching/splitting talks a lot about indices being at unicode boundaries. I do not know what happens if they are not.
  • I suspect somewhere it might also lead to out-of-bounds accesses when it thinks there is a 4-byte character following, but only 2 bytes are left in the buffer.
  • I did not find any unreachable/unreachable_unchecked.

@joshtriplett
Copy link
Member

This seems entirely reasonable to me. If you never call any of str's functions, just storing non-UTF-8 in it shouldn't cause any issue.

I wonder if we might be able, in the future, to carefully exclude a few of str's functions from the "library UB" requirements.

cc @BurntSushi: Would this change potentially simplify bstr?

@RalfJung
Copy link
Member Author

I wonder if we might be able, in the future, to carefully exclude a few of str's functions from the "library UB" requirements.

Yes I think that is definitely possible. It is somewhat similar to, for example, how we promise that Vec::push will keep existing pointers working if it does not reallocate (though that is AFAIK not very clearly documented): library methods can make extra promises beyond the ways they could be used in safe code.

@steveklabnik
Copy link
Member

Shouldn't this be an RFC, rather than just an issue on the Rust tracker?

@Centril
Copy link
Contributor

Centril commented Apr 13, 2020

The RFC format feels excessive as a matter of "filling in the forms" (I don't know that it would add anything). The design choices here aren't too complicated and it doesn't feel like there's any opposition thus far to the change here, given that LLVM et. al cannot exploit this. The FCP should provide the same chance of community input here I feel given the usual coverage in TWiR etc.

@nikomatsakis
Copy link
Contributor

I think the main point is that all the library's routines are available as public routines to be invoked on any &str. So this means that having a str which is exposed to safe code must meet those invariants. This doesn't necessarily have to come from a validity invariant, but it is still a core propery of str.

@BurntSushi
Copy link
Member

I think this change makes a lot of sense. I've always kind of wondered why str's UTF-8 invariant was defined at the language level. Although, since this is moving a language invariant to a library invariant, it kind of feels like the libs team should be consulted? cc @rust-lang/libs And in particular, it would be great to hear from @SimonSapin

Also, are there any changes the str API docs in std that need to be updated as a result of this change? (It looks like the current docs don't really mention UB much here.)

cc @BurntSushi: Would this change potentially simplify bstr?

Possibly, but it's tricky business. For example, one might reasonably exclude substring searching from UB if neither the needle nor the haystack were valid UTF-8. Then bstr could use std's substring searching instead of rolling its own. But, bstr already has a substring searcher, so it's somewhat already been paid.

bstr would still need to roll UTF-8 decoding to implement its own version of chars, for example, since I chars is probably a method that you want to keep UB when str isn't valid UTF-8.

@SimonSapin
Copy link
Contributor

This change sounds reasonable to me from the point of view of how the language is defined. I don’t expect it will have much or any practical impact for users of the language.

@scottmcm
Copy link
Member

scottmcm commented Apr 16, 2020

I assume this would also be a stepping-stone to allowing str to become just struct str([u8]);?

@rfcbot reviewed

@scottmcm
Copy link
Member

scottmcm commented Apr 16, 2020

Hmm, actually, we talked about this in the lang meeting today and Josh checked and UTF-8 encoding units do have a niche that 11111xxx is never used. So one possibility would be to add a new Utf8EncodingUnit type with that as a validity invariant (like we have a validity invariant on NonZeroU8), and say that str is a newtype over [Utf8EncodingUnit].

However, this proposed change would prevent us from doing that in future, as I'd be allowed to make a slice full of 255_u8 in unsafe code and call it a str, which would violate that hypothetical validity invariant, so I wanted to bring it up. [Edit: Oh, in fact such an example is in the first reply to this thread. If that were made legal it could not be reasonably become UB again in future.]

I do strongly agree that the cross-byte UTF-8 well-formedness should not be in the validity invariant.

@eddyb
Copy link
Member

eddyb commented Apr 16, 2020

@scottmcm Hmm, so 0xf8..=0xff are invalid UTF-8 bytes?

I haven't tried it yet but this should work today:

#[rustc_layout_scalar_valid_range_end(0xf7)]
struct Utf8EncodingUnit(u8);

I'm not sure if we can take advantage of this in any way, but it might be possible to force miri into validating it without any special-casing (just based on the validity range).

@Nemo157
Copy link
Member

Nemo157 commented Apr 16, 2020

Making even that guarantee would presumably make any use of str::as_bytes_mut as UB as it probably is today (I believe that after the proposed changes it would become non-UB to use as long as you fix the bytes before the &mut [u8] is dropped, which is something that I would like to have code relying on).

EDIT: I guess it would fix the second example in the docs to be non-UB, but not being able to pass the &mut [u8] into other safe code would be very limiting still.

@RalfJung
Copy link
Member Author

I'm not sure if we can take advantage of this in any way, but it might be possible to force miri into validating it without any special-casing (just based on the validity range).

Miri validates rustc_layout_scalar_valid_range_end without any extra work. But only by-value, not behind references.

@SimonSapin
Copy link
Contributor

Can that niche ever be useful? Option<str> cannot exist since str: !Sized

@SimonSapin
Copy link
Contributor

Hmm, so 0xf8..=0xff are invalid UTF-8 bytes?

Correct. Those are 0b11111000..=0b11111111, and per https://tools.ietf.org/html/rfc3629#section-3 all well-formed byte sequences are one of:

   Char. number range  |        UTF-8 octet sequence
      (hexadecimal)    |              (binary)
   --------------------+---------------------------------------------
   0000 0000-0000 007F | 0xxxxxxx
   0000 0080-0000 07FF | 110xxxxx 10xxxxxx
   0000 0800-0000 FFFF | 1110xxxx 10xxxxxx 10xxxxxx
   0001 0000-0010 FFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

(For historical anecdote: Unicode is limited to U+0000..=U+10FFFF because of UTF-16, but the original design of UTF-8 supported up to U+7FFFFFFF (31 bits) with leading bytes 0xf8..=0xfd: https://tools.ietf.org/html/rfc2279#section-2. I think Unicode needing to grow beyond a million assigned code point is not something we need to worry about, given https://www.unicode.org/versions/stats/chart_charbyyear.html. And anyway Rust also limits char to U+10FFFF max.)

@eddyb
Copy link
Member

eddyb commented Apr 16, 2020

@SimonSapin That's not the only thing valid ranges do, we also expose them to LLVM.
So depending on what the str methods look like, they might be able to take advantage of this to consider the values outside of the range impossible, without using unreachable_unchecked.
Specifically, when reading UTF-8 bytes from behind a &str.

@SimonSapin
Copy link
Contributor

I suppose LLVM could then eliminate a branch like str.as_bytes()[0] == 0xff but I have a hard time imagining a situation where it would come up in practice outside of test cases.

@RalfJung
Copy link
Member Author

Sure, as_bytes_mut should mention whatever validity invariant we end up with.

@nikomatsakis
Copy link
Contributor

Actually, @RalfJung, I realize I'm a bit uncertain about how validity invariants interact with borrows -- if I have a local variable that x: T that is mutably borrowed, must the validity invariant for type T hold at every moment, or only when the borrow is ended (by x being used again)? This seems relevant to race conditions and the like.

@nikomatsakis
Copy link
Contributor

Perhaps better to move that off line =)

@RalfJung
Copy link
Member Author

@nikomatsakis I think your question is tracked at rust-lang/unsafe-code-guidelines#84 :)

@rfcbot rfcbot added final-comment-period In the final comment period and will be merged soon unless new substantive objections are raised. and removed proposed-final-comment-period Proposed to merge/close by relevant subteam, see T-<team> label. Will enter FCP once signed off. labels Apr 22, 2020
@rfcbot
Copy link

rfcbot commented Apr 22, 2020

🔔 This is now entering its final comment period, as per the review above. 🔔

@Kixunil
Copy link
Contributor

Kixunil commented Apr 23, 2020

Would be great to make extra sure people don't mistake those two definitions of UB and don't start recklessly working with non-utf-8 data in string thinking it's not UB anymore.

@jdahlstrom
Copy link

jdahlstrom commented Apr 24, 2020

A minor point regarding terminology: Aren't these "UTF8 encoding units" called (UTF-8) code units in established Unicode parlance? Which, incidentally, is probably an argument in favor of adding a named type as it would simply reify an existing concept rather than introducing an ad-hoc one!

@scottmcm
Copy link
Member

scottmcm commented Apr 25, 2020

EDIT: Skip my post and read the next one

@jdahlstrom Good point: http://www.unicode.org/glossary/#code_unit

Utf8CodeUnit would be a much better name.

@SimonSapin
Copy link
Contributor

Yes, “code unit” is established in Unicode but the code unit for UTF-8 is u8, not a specialization of u8 that has validity invariants of its own. It’s only for a UTF-8 sequence of bytes (a.k.a. 8-bit code units) that Unicode defines being “well-formed”.

The formal definitions for this start at https://www.unicode.org/versions/Unicode13.0.0/ch03.pdf#G7404

@RalfJung
Copy link
Member Author

RalfJung commented Apr 25, 2020

FCP was originally requested for changing the validity invariant of str to that of [u8], but during discussion consensus seems to have shifted towards rather using that of [Utf8Byte] with

#[rustc_layout_scalar_valid_range_end(0xf7)]
struct Utf8Byte(u8);

(the name of that type is still up for bikeshedding)

I am a bit confused now about what this FCP is actually deciding, if/when it completes.

@crlf0710
Copy link
Member

Mmm, then maybe it makes sense to also remove language-level UB for non Unicode scalar chars too?

@programmerjake
Copy link
Member

Mmm, then maybe it makes sense to also remove language-level UB for non Unicode scalar chars too?

Rust does exploit the limited values that can be stored in a char:

Here, returning None for Option<char> is the same as returning 0x110000u32 which is one past the largest unicode scalar value.
https://play.rust-lang.org/?version=stable&mode=release&edition=2018&gist=ad1776519737e67fbe35bf76bb8451c4

@SimonSapin
Copy link
Contributor

@RalfJung I think there’s consensus that such a definition for Utf8Byte would not be wrong or harmful. It’s much less clear (at least to me) that it’s actually useful. What actual optimization would this rustc_layout_scalar_valid_range_end (potentially) enable, given that str only ever exists behind a pointer indirection such as &str?

Process-wise, my understanding is that an FCP finishing means accepting a proposal as it was when that FCP was proposed. If new information or consensus emerges later (especially if it’s after some votes) and a team member feels the original proposal should not be accepted, they should file a concern or cancel the FCP. (And potentially propose a new FCP for a different proposal.)

@Nemo157
Copy link
Member

Nemo157 commented Apr 28, 2020

If there's no potential optimizations to apply it seems better to completely remove the validity invariant as proposed, to allow temporarily using the backing buffer from str::as_bytes_mut with less manually checked safety. With a non-compiler-checked validity invariant on those bytes it makes working with an &mut str much more error-prone.

Alternatively, exposing Utf8Byte, an unsafe fn as_utf8_bytes_mut(&mut self) -> &mut [Utf8Byte], some checked operations for Utf8Byte, and some way to wrap an arbitrary &mut [u8] into an &mut [Utf8Byte] (I'm not sure if pre-validating then transmuting this would be valid even with repr(transparent) with the additional layout constraints); all together would allow building useful APIs on &mut str with less unsafe code.

@nikomatsakis
Copy link
Contributor

I agree with @SimonSapin. I would not bother with the "encoding units" myself, which seems like an extra layer of complexity that doesn't add much of practical value.

@rfcbot rfcbot added finished-final-comment-period The final comment period is finished for this PR / Issue. and removed final-comment-period In the final comment period and will be merged soon unless new substantive objections are raised. labels May 2, 2020
@rfcbot
Copy link

rfcbot commented May 2, 2020

The final comment period, with a disposition to merge, as per the review above, is now complete.

As the automated representative of the governance process, I would like to thank the author for their work and everyone else who contributed.

The RFC will be merged soon.

@RalfJung
Copy link
Member Author

RalfJung commented May 2, 2020

So the FCP that passed means the decision was that str is like [u8]?

@nikomatsakis
Copy link
Contributor

I believe so.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-Unicode Area: Unicode C-enhancement Category: An issue proposing an enhancement or a PR with one. disposition-merge This issue / PR is in PFCP or FCP with a disposition to merge it. finished-final-comment-period The final comment period is finished for this PR / Issue. T-lang Relevant to the language team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging a pull request may close this issue.