-
Notifications
You must be signed in to change notification settings - Fork 12.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add core::char::DecodeUtf8 #33907
add core::char::DecodeUtf8 #33907
Conversation
Thanks for the pull request, and welcome! The Rust team is excited to review your changes, and you should hear from @alexcrichton (or someone else) soon. If any changes to this PR are deemed necessary, please add them as extra commits. This ensures that the reviewer can see what has changed since they last reviewed the code. Due to the way GitHub handles out-of-date commits, this should also make it reasonably obvious what issues have or haven't been addressed. Large or tricky changes may require several passes of review and changes. Please see the contribution instructions for more information. |
It would be more flexible to make the iterator's item type |
type Item = char; | ||
#[inline] fn next(&mut self) -> Option<char> { | ||
let repl = '\u{FFFD}'; | ||
self.0.next().map(|b| if b & 0x80 == 0 { unsafe { from_u32_unchecked(b as u32) } } else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can just be b as char
, no need for the unsafe from_u32_unchecked
.
|
||
#[unstable(feature = "decode_utf8", issue = "33906")] | ||
impl<I: Iterator<Item = u8>> Iterator for DecodeUtf8<I> { | ||
type Item = Option<char>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Idiomatically we tend to prefer to return a Result
rather than an Option
to indicate an error. This helps us give an entry point for an opaque struct to attach error information to.
Thanks @strake! Could you also reexport this type from librustc_unicode? That way it'll make its way into |
@alexcrichton All done. Not sure whether github notified you when i pushed... |
(@strake Github doesn't notify on pushes.) |
/// `<DecodeUtf8 as Iterator>::next` returns this for an invalid input sequence. | ||
#[unstable(feature = "decode_utf8", issue = "33906")] | ||
#[derive(Debug)] | ||
pub struct InvalidSequence; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this be struct InvalidSequence(())
so we can backwards-compatibly add more variants? You may want to also derive PartialEq
for it so assert_eq!
can work.
Thanks @strake! |
@alexcrichton Done. The check failed on Travis but passes on my machine; the failure seems to be in APT, not due to my modifications. |
Ah yeah travis is unfortunately having issues as of late, but otherwise looks good to me. This is tagged with T-libs to ensure it comes up during triage, and we'll discuss there! |
Thanks again for the PR @strake! The libs team got a chance to discuss this today and there were a few concerns, but the overall feeling was somewhat positive.
|
All the methods i see in
Well, i'm writing a text editor which must deal with strings (e.g. file paths) which are UTF-8-coded by default but potentially invalid, in which case it ought to show the user what valid subsequences of the path it can. (I consider crashing on allocational failure ill-behavior in a text editor, so i am not using |
Would it be possible to refactor a bit to share an implementation? It's not obvious unfortunately that this is correct as it seems some subtle logic is happening here, but it'd be nice to lean on the existing code. Also just out of curiosity, but how come |
I actually since found some helper functions in
If its input has any invalid sequences, |
☔ The latest upstream changes (presumably #34399) made this pull request unmergeable. Please resolve the merge conflicts. |
Hm yeah it's probably good to use at least some helper functions, but I believe we on the libs team were thinking more of wholesale using an implementation in It may not be the case that it's possible to share as |
Yeah, i couldn't see how to so modify |
The code does not decode UTF-8 according to the spec: https://encoding.spec.whatwg.org/#utf-8-decoder. It eats characters after an invalid byte sequence and does not check for overlong encodings. I suggest you to just implement the described algorithm, that also makes it easier to verify it's doing the right thing. |
I modified it to not eat characters and to check for too-long sequences. It now returns the reason why a sequence is invalid. I looked at the algorithm on that page, and it seems more complicated, having more internal state, and is not of the same form as Iterator::next, potentially returning multiple tokens or "continue". |
type Item = Result<(u32, usize), ()>; | ||
#[inline] | ||
fn next(&mut self) -> Option<Result<(u32, usize), ()>> { | ||
self.0.next().map(|b| if b & 0x80 == 0 { Ok((b as u32, 1)) } else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Stylistically we tend to have multi-line closures always have braces around them, putting this if
on the next line.
You could make this similarly intended by doing:
foo.map(|b| {
if condition {
return Ok(...)
}
...
})
@rust-lang/libs thoughts on having another utf-8 decoding loop by hand in libcore? It seems like our previous ones don't quite apply, and if it at least clearly follows a spec it won't exactly change that often so perhaps not the worst! |
I think it shouldn't return precise errors. This makes it hard to change the underlying algorithm and is probably not useful to the user of the API anyway. |
@alexcrichton I don't really care about having yet another UTF-8 decoder. The one test looks light - @strake does the test cover all the cases? |
|
("�", &[0xFEu8] as &[u8]), | ||
("�A", &[0xFEu8, 0x41u8] as &[u8]), | ||
("�", &[0xFFu8] as &[u8]), | ||
("�A", &[0xFFu8, 0x41u8] as &[u8])].into_iter() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are any of these test cases checking for overlong sequences?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, [0xC1, 0x81] is an overlong sequence for 'A'.
@alexcrichton I don't object to having an additional hand-written loop. I expect us to grow this kind of micro-optimization over time in std. |
Ok, @strake could you also squash the commits down? After that I'll r+ |
@alexcrichton yes, done |
⌛ Testing commit 837029f with merge 6998018... |
add core::char::DecodeUtf8 See [issue](#33906)
See issue