add core::char::DecodeUtf8 #33907

strake · 2016-05-27T16:53:34Z

rust-highfive · 2016-05-27T16:53:48Z

Thanks for the pull request, and welcome! The Rust team is excited to review your changes, and you should hear from @alexcrichton (or someone else) soon.

If any changes to this PR are deemed necessary, please add them as extra commits. This ensures that the reviewer can see what has changed since they last reviewed the code. Due to the way GitHub handles out-of-date commits, this should also make it reasonably obvious what issues have or haven't been addressed. Large or tricky changes may require several passes of review and changes.

Please see the contribution instructions for more information.

tbu- · 2016-05-30T07:44:02Z

It would be more flexible to make the iterator's item type Option<char>, so the user of this API can decide what to do on bad characters: Erroring out, replacing it with the replacement character (.unwrap_or(replacement)), ...

tbu- · 2016-05-30T07:44:57Z

src/libcore/char.rs

+    type Item = char;
+    #[inline] fn next(&mut self) -> Option<char> {
+        let repl = '\u{FFFD}';
+        self.0.next().map(|b| if b & 0x80 == 0 { unsafe { from_u32_unchecked(b as u32) } } else {


This can just be b as char, no need for the unsafe from_u32_unchecked.

alexcrichton · 2016-05-31T03:58:59Z

src/libcore/char.rs

+
+#[unstable(feature = "decode_utf8", issue = "33906")]
+impl<I: Iterator<Item = u8>> Iterator for DecodeUtf8<I> {
+    type Item = Option<char>;


Idiomatically we tend to prefer to return a Result rather than an Option to indicate an error. This helps us give an entry point for an opaque struct to attach error information to.

alexcrichton · 2016-05-31T04:00:43Z

Thanks @strake! Could you also reexport this type from librustc_unicode? That way it'll make its way into std::char

strake · 2016-06-01T06:48:07Z

@alexcrichton All done. Not sure whether github notified you when i pushed...

tbu- · 2016-06-01T07:50:59Z

(@strake Github doesn't notify on pushes.)

alexcrichton · 2016-06-01T19:54:52Z

src/libcore/char.rs

+/// `<DecodeUtf8 as Iterator>::next` returns this for an invalid input sequence.
+#[unstable(feature = "decode_utf8", issue = "33906")]
+#[derive(Debug)]
+pub struct InvalidSequence;


Could this be struct InvalidSequence(()) so we can backwards-compatibly add more variants? You may want to also derive PartialEq for it so assert_eq! can work.

alexcrichton · 2016-06-01T19:55:17Z

Thanks @strake!

strake · 2016-06-02T03:55:20Z

@alexcrichton Done. The check failed on Travis but passes on my machine; the failure seems to be in APT, not due to my modifications.

alexcrichton · 2016-06-02T06:08:10Z

Ah yeah travis is unfortunately having issues as of late, but otherwise looks good to me. This is tagged with T-libs to ensure it comes up during triage, and we'll discuss there!

alexcrichton · 2016-06-21T04:33:44Z

Thanks again for the PR @strake! The libs team got a chance to discuss this today and there were a few concerns, but the overall feeling was somewhat positive.

Would it be possible to reuse code between std::str and here somehow to avoid open-coding decoding utf-8 twice?
Do you have a particular use case in mind for this functionality beyond simply consistency with other APIs? (which is of course still good, however)

strake · 2016-06-21T08:33:24Z

Would it be possible to reuse code between std::str and here somehow to avoid open-coding decoding utf-8 twice?

All the methods i see in str take a &[u8] or a slice::Iter rather than an arbitrary Iterator, and i'd prefer to take an arbitrary Iterator.

Do you have a particular use case in mind for this functionality beyond simply consistency with other APIs? (which is of course still good, however)

Well, i'm writing a text editor which must deal with strings (e.g. file paths) which are UTF-8-coded by default but potentially invalid, in which case it ought to show the user what valid subsequences of the path it can. (I consider crashing on allocational failure ill-behavior in a text editor, so i am not using std.)

alexcrichton · 2016-06-21T14:21:10Z

Would it be possible to refactor a bit to share an implementation? It's not obvious unfortunately that this is correct as it seems some subtle logic is happening here, but it'd be nice to lean on the existing code.

Also just out of curiosity, but how come str::from_utf8 won't suffice for this use case? Presumably the bytes are held in memory somewhere?

strake · 2016-06-21T18:29:19Z

It's not obvious unfortunately that this is correct as it seems some subtle logic is happening here

I actually since found some helper functions in str we can use here. I think it clarifies the DecodeUtf8::next code. Sorry i missed this last time.

how come str::from_utf8 won't suffice for this use case?

If its input has any invalid sequences, str::from_utf8 fails completely. I need to decode the valid parts of a byte string or byte stream which potentially includes invalid sequences.

bors · 2016-06-24T14:18:09Z

☔ The latest upstream changes (presumably #34399) made this pull request unmergeable. Please resolve the merge conflicts.

alexcrichton · 2016-06-24T16:14:08Z

Hm yeah it's probably good to use at least some helper functions, but I believe we on the libs team were thinking more of wholesale using an implementation in str or something like that. It looks like there's still plenty of opportunity to go wrong in the DecodeUtf8 iterator as a significant amount of it's still open-coded.

It may not be the case that it's possible to share as str::from_utf8 is so specialized, but it was a passing thought of ours. In general we've avoided util modules, so if it comes to that it's probably not worth sharing if it's still open coded as much as it is still.

strake · 2016-06-25T02:20:08Z

It may not be the case that it's possible to share as str::from_utf8 is so specialized, but it was a passing thought of ours.

Yeah, i couldn't see how to so modify run_utf8_validation as it is so specialized to to validation rather than decoding, and next_code_point has a comment saying "Performance is sensitive to the exact formulation here" so i dare not modify it.

tbu- · 2016-06-25T08:52:03Z

The code does not decode UTF-8 according to the spec: https://encoding.spec.whatwg.org/#utf-8-decoder. It eats characters after an invalid byte sequence and does not check for overlong encodings. I suggest you to just implement the described algorithm, that also makes it easier to verify it's doing the right thing.

strake · 2016-06-28T03:55:42Z

I modified it to not eat characters and to check for too-long sequences. It now returns the reason why a sequence is invalid.

I looked at the algorithm on that page, and it seems more complicated, having more internal state, and is not of the same form as Iterator::next, potentially returning multiple tokens or "continue".

alexcrichton · 2016-07-05T18:10:20Z

src/libcore/char.rs

+    type Item = Result<(u32, usize), ()>;
+    #[inline]
+    fn next(&mut self) -> Option<Result<(u32, usize), ()>> {
+        self.0.next().map(|b| if b & 0x80 == 0 { Ok((b as u32, 1)) } else {


Stylistically we tend to have multi-line closures always have braces around them, putting this if on the next line.

You could make this similarly intended by doing:

foo.map(|b| { if condition { return Ok(...) } ... })

alexcrichton · 2016-07-05T18:11:41Z

@rust-lang/libs thoughts on having another utf-8 decoding loop by hand in libcore? It seems like our previous ones don't quite apply, and if it at least clearly follows a spec it won't exactly change that often so perhaps not the worst!

tbu- · 2016-07-05T18:41:46Z

I think it shouldn't return precise errors. This makes it hard to change the underlying algorithm and is probably not useful to the user of the API anyway.

brson · 2016-07-05T18:45:46Z

@alexcrichton I don't really care about having yet another UTF-8 decoder. The one test looks light - @strake does the test cover all the cases?

strake · 2016-07-06T00:52:53Z

Braced closure
No more precise errors
More tests, which i think cover all the cases now

BurntSushi · 2016-07-06T10:40:11Z

src/libcoretest/char.rs

+                       ("�", &[0xFEu8] as &[u8]),
+                       ("�A", &[0xFEu8, 0x41u8] as &[u8]),
+                       ("�", &[0xFFu8] as &[u8]),
+                       ("�A", &[0xFFu8, 0x41u8] as &[u8])].into_iter() {


Are any of these test cases checking for overlong sequences?

Yes, [0xC1, 0x81] is an overlong sequence for 'A'.

aturon · 2016-07-13T03:34:36Z

@alexcrichton I don't object to having an additional hand-written loop. I expect us to grow this kind of micro-optimization over time in std.

alexcrichton · 2016-07-13T17:56:49Z

Ok, @strake could you also squash the commits down? After that I'll r+

strake · 2016-07-14T01:44:26Z

@alexcrichton yes, done

alexcrichton · 2016-07-14T06:31:53Z

@bors: r+ 837029f

Thanks!

bors · 2016-07-14T09:45:29Z

⌛ Testing commit 837029f with merge 6998018...

add core::char::DecodeUtf8 See [issue](#33906)

bors · 2016-07-14T12:42:56Z

rust-highfive assigned alexcrichton May 27, 2016

strake mentioned this pull request May 27, 2016

make core::str::next_code_point work on arbitrary iterator #33896

Merged

tbu- reviewed May 30, 2016
View reviewed changes

alexcrichton added the T-libs-api Relevant to the library API team, which will review and decide on the PR/issue. label May 31, 2016

alexcrichton reviewed May 31, 2016
View reviewed changes

alexcrichton reviewed Jun 1, 2016
View reviewed changes

strake force-pushed the decode_utf8 branch from 4ff8e56 to 2df2b82 Compare June 2, 2016 02:41

strake force-pushed the decode_utf8 branch from 8880e32 to 07bcac9 Compare June 22, 2016 01:35

strake force-pushed the decode_utf8 branch from 2a38a43 to e36a028 Compare June 25, 2016 02:17

alexcrichton reviewed Jul 5, 2016
View reviewed changes

BurntSushi reviewed Jul 6, 2016
View reviewed changes

add core::char::DecodeUtf8

837029f

strake force-pushed the decode_utf8 branch from c3344af to 837029f Compare July 14, 2016 01:43

bors added a commit that referenced this pull request Jul 14, 2016

Auto merge of #33907 - strake:decode_utf8, r=alexcrichton

6998018

add core::char::DecodeUtf8 See [issue](#33906)

bors merged commit 837029f into rust-lang:master Jul 14, 2016

martinhath mentioned this pull request Jan 12, 2017

std::path::Path::Display allocates internally #38879

Closed

strake deleted the decode_utf8 branch October 29, 2017 08:54

add core::char::DecodeUtf8 #33907

add core::char::DecodeUtf8 #33907

Uh oh!

Conversation

strake commented May 27, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rust-highfive commented May 27, 2016

Uh oh!

tbu- commented May 30, 2016

Uh oh!

tbu- May 30, 2016

Choose a reason for hiding this comment

Uh oh!

alexcrichton May 31, 2016

Choose a reason for hiding this comment

Uh oh!

alexcrichton commented May 31, 2016

Uh oh!

strake commented Jun 1, 2016

Uh oh!

tbu- commented Jun 1, 2016

Uh oh!

alexcrichton Jun 1, 2016

Choose a reason for hiding this comment

Uh oh!

alexcrichton commented Jun 1, 2016

Uh oh!

strake commented Jun 2, 2016

Uh oh!

alexcrichton commented Jun 2, 2016

Uh oh!

alexcrichton commented Jun 21, 2016

Uh oh!

strake commented Jun 21, 2016

Uh oh!

alexcrichton commented Jun 21, 2016

Uh oh!

strake commented Jun 21, 2016

Uh oh!

bors commented Jun 24, 2016

Uh oh!

alexcrichton commented Jun 24, 2016

Uh oh!

strake commented Jun 25, 2016

Uh oh!

tbu- commented Jun 25, 2016

Uh oh!

strake commented Jun 28, 2016

Uh oh!

alexcrichton Jul 5, 2016

Choose a reason for hiding this comment

Uh oh!

alexcrichton commented Jul 5, 2016

Uh oh!

tbu- commented Jul 5, 2016

Uh oh!

brson commented Jul 5, 2016

Uh oh!

strake commented Jul 6, 2016

Uh oh!

BurntSushi Jul 6, 2016

Choose a reason for hiding this comment

Uh oh!

strake Jul 6, 2016

Choose a reason for hiding this comment

Uh oh!

aturon commented Jul 13, 2016

Uh oh!

alexcrichton commented Jul 13, 2016

Uh oh!

strake commented Jul 14, 2016

Uh oh!

alexcrichton commented Jul 14, 2016

Uh oh!

bors commented Jul 14, 2016

Uh oh!

bors commented Jul 14, 2016

Uh oh!

Uh oh!

strake commented May 27, 2016 •

edited

Loading