-
Notifications
You must be signed in to change notification settings - Fork 13k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Move OsStr::slice_encoded_bytes
validation to platform modules
#118569
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -885,15 +885,43 @@ fn decode_surrogate_pair(lead: u16, trail: u16) -> char { | |
unsafe { char::from_u32_unchecked(code_point) } | ||
} | ||
|
||
/// Copied from core::str::StrPrelude::is_char_boundary | ||
/// Copied from str::is_char_boundary | ||
#[inline] | ||
pub fn is_code_point_boundary(slice: &Wtf8, index: usize) -> bool { | ||
if index == slice.len() { | ||
if index == 0 { | ||
return true; | ||
} | ||
match slice.bytes.get(index) { | ||
None => false, | ||
Some(&b) => b < 128 || b >= 192, | ||
None => index == slice.len(), | ||
Some(&b) => (b as i8) >= -0x40, | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why did this implementation change? (In particular it seems like the behavior is no longer an unconditional true for index = 0, and also doesn't correspond with the str::is_char_boundary impl?) The There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is identical to the I can't use There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ah, so it is. We should think about exposing that as a public API, it seems consistent with the is_ascii_ functions we already expose that have similar bit-twiddling internally. |
||
} | ||
} | ||
|
||
/// Verify that `index` is at the edge of either a valid UTF-8 codepoint | ||
/// (i.e. a codepoint that's not a surrogate) or of the whole string. | ||
/// | ||
/// These are the cases currently permitted by `OsStr::slice_encoded_bytes`. | ||
/// Splitting between surrogates is valid as far as WTF-8 is concerned, but | ||
/// we do not permit it in the public API because WTF-8 is considered an | ||
/// implementation detail. | ||
#[track_caller] | ||
#[inline] | ||
pub fn check_utf8_boundary(slice: &Wtf8, index: usize) { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm confused how this is different from the is_code_point_boundary (from just the method name/comments)?
I'm worried that it'll be easy to use the wrong function so I think some detail on when we should use each in comments would be good. It's also a bit worrying to me that we want a new function since that feels like it implies we're changing behavior rather than just optimizing here? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm posting a longer explanation below, but the gist of it is that there are some WTF-8 codepoints that are not UTF-8 codepoints. If the string is pure UTF-8 then the boundaries are the same. I've tweaked the comment a little. It's not a behavioral change, the old function wasn't used for this functionality to begin with. The cases in which this implementation panics should be the same as those in which the old one does. |
||
if index == 0 { | ||
return; | ||
} | ||
match slice.bytes.get(index) { | ||
Some(0xED) => (), // Might be a surrogate | ||
Some(&b) if (b as i8) >= -0x40 => return, | ||
Some(_) => panic!("byte index {index} is not a codepoint boundary"), | ||
None if index == slice.len() => return, | ||
None => panic!("byte index {index} is out of bounds"), | ||
} | ||
if slice.bytes[index + 1] >= 0xA0 { | ||
// There's a surrogate after index. Now check before index. | ||
if index >= 3 && slice.bytes[index - 3] == 0xED && slice.bytes[index - 2] >= 0xA0 { | ||
panic!("byte index {index} lies between surrogate codepoints"); | ||
} | ||
} | ||
} | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What makes this a public boundary? Is there a private boundary?
I think it would be good to add some docs here if we're adding an extension point -- perhaps a couple lines of common describing what the function should (roughly) do?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The public boundaries are where we let users split without panicking. The private boundaries would depend on the safety invariants for the internal encoding. For example if you split in the middle of a WTF-8 codepoint you can cause out-of-bounds reads, so that's neither a public nor a private boundary. But if you split between surrogate codepoints then that's fine as far as the implementation is concerned, we just don't allow users to do that, so that's a private boundary but not a public boundary.
I've added a comment, good call.