Document Unicode complications when iterating "characters" #27012

kornelski · 2015-07-13T11:58:02Z

This PR tries to clarify uses of "character" where it means "code point" or "UTF-8 sequence", which are almost, but not quite the same. Edge cases added to some examples to demonstrate this.

However, I've kept use of the term "code point" instead of "Unicode scalar value", because in UTF-8 they're the same, and "code point" is more widely known.

rust-highfive · 2015-07-13T11:58:19Z

Thanks for the pull request, and welcome! The Rust team is excited to review your changes, and you should hear from @gankro (or someone else) soon.

If any changes to this PR are deemed necessary, please add them as extra commits. This ensures that the reviewer can see what has changed since they last reviewed the code. The way Github handles out-of-date commits, this should also make it reasonably obvious what issues have or haven't been addressed. Large or tricky changes may require several passes of review and changes.

Please see the contribution instructions for more information.

bluss · 2015-07-13T15:29:38Z

src/libcollections/str.rs

@@ -599,7 +599,7 @@ impl str {
    /// # #![feature(str_char, core)]
    /// use std::str::CharRange;
    ///
-    /// let s = "中华Việt Nam";
+    /// let s = "中华Việt Nam";


github's diff rendering is broken, but this is ok in the source.

The diff is not broken, the source was changed to use "decomposed" code points (an ASCII e followed by combining code points, rather than ệ as a single code point) as the "This outputs" change below indicates.

(This change illustrates why code points may not be the unit you want.)

Might depend on the browser? Since it renders the diacritics from the e on top of the t in the after version, I think the diff rendering is broken here.

Ah, yeah, that’s a rendering bug. I thought you were saying there was not change at all there in the source.

bluss · 2015-07-13T15:49:34Z

I don't really like maximal pedantism either (everything we say is subtly wrong or not the whole story anyway), but I think Unicode Scalar Value is the only correct term for what we mean. Otherwise we could invent “Acceptable UTF-8 code point”.. We did in fact already invent a term, which is Rust's char, so I like that.

Unicode standard version 7.0 section 2.4 Code Points and Characters (PDF) is interesting and complicated -- some code points are representable in UTF-8 and some are not, some USVs are abstract characters, some noncharacters and some neither.

kornelski · 2015-07-13T21:22:57Z

@bluss OK, I've reverted the \r example and changed code points to Unicode scalar values. It does sound a bit awkward though.

bluss · 2015-07-13T21:29:03Z

src/libcollections/str.rs

    ///
-    /// If the slice does not contain any characters, None is returned instead.
+    /// If the slice does not contain any Unicode scalar values, None is returned instead.


This could actually just say "if the slice is empty"

bluss · 2015-07-13T21:30:28Z

Yes I agree, we don't want to read USV everywhere. So we should accept that code point is a subset of USV(?), so everywhere we say we return a code point etc, we are totally fine?

bluss · 2015-07-13T21:32:09Z

src/libcollections/str.rs

    /// string.
    ///
    /// Due to the design of UTF-8, this operation is `O(end)`. Use slicing
-    /// syntax if you want to use byte indices rather than codepoint indices.
+    /// syntax if you want to use byte indices rather than `char` indices.


Oh this is confusing -- we have an iterator called char_indices() which returns byte offsets from the start of the string paired with char.

kornelski · 2015-07-13T21:46:51Z

OK, I've dialed down USV usage, and clarified the other bits.

bluss · 2015-07-13T21:54:44Z

src/libcollections/str.rs

+    /// For iteration over human-readable characters a [grapheme cluster iterator][1]
+    /// may be more appropriate.
+    ///
+    /// [1]: https://unicode-rs.github.io/unicode-segmentation/unicode_segmentation


This link doesn't actually give the reader any indication where to find & how to use the crate -- crates.io might be more appropriate. I don't know however what we usually do in this situation, do we link out to third party crates?

Funny — I've tried to link to the exact iterator page, but the link exceeds the 100-char line limit!

I actually meant I'd prefer a link to the crate's page on crates.io. The documentation gives no information on the source, homepage, download location of the crate.

bors · 2015-07-14T01:22:56Z

☔ The latest upstream changes (presumably #26241) made this pull request unmergeable. Please resolve the merge conflicts.

nagisa · 2015-07-15T22:03:20Z

src/libcollections/str.rs

    ///
    /// # Examples
    ///
    /// ```
    /// # #![feature(str_char)]
-    /// let s = "Löwe 老虎 Léopard";
+    /// let s = "Łódź";


Is there a reason to change this example?

Yes, ASCII L was an easy case that doesn't illustrate the trickyness.

I wanted to show behavior with a mix of composed and decomposed characters (you can see first call takes whole composed Ł, but the second call only gets o "half" of ó, leaving modifier codepoint behind).

I wonder how we can make the normalization-dependent examples clear. A particular codepoint decomposition might not even survive the documentation processing (it might be renormalized), but that's of course that's mostly invisible to a majority of readers anyway.

I think I could add a comment with the string written using unicode escapes (I avoided using escapes in the example code itself, because it makes code harder to read and without characters rendered you don't see what's going on).

bors · 2015-07-21T14:45:59Z

☔ The latest upstream changes (presumably #27168) made this pull request unmergeable. Please resolve the merge conflicts.

Gankra · 2015-07-24T23:21:32Z

Oh whoops, apparently I've been assigned to this...

Gankra · 2015-07-24T23:24:17Z

r=me assuming travis is happy with the latest update

Gankra · 2015-07-24T23:24:38Z

Although honestly I usually defer String problems to @SimonSapin ...

SimonSapin · 2015-07-24T23:57:56Z

Looks good to me.

Gankra · 2015-07-25T04:05:03Z

@pornel just need a squash of the commits

kornelski · 2015-07-26T00:31:08Z

Squashed

Gankra · 2015-07-26T17:43:36Z

@bors r+ rollup

Thanks!

bors · 2015-07-26T17:43:37Z

📌 Commit c20e3fc has been approved by Gankro

bors · 2015-07-26T20:18:40Z

⌛ Testing commit c20e3fc with merge 6232f95...

Fixes #26689 This PR tries to clarify uses of "character" where it means "code point" or "UTF-8 sequence", which are almost, but not quite the same. Edge cases added to some examples to demonstrate this. However, I've kept use of the term "code point" instead of "Unicode scalar value", because in UTF-8 they're the same, and "code point" is more widely known.

bors · 2015-07-26T22:00:51Z

☀️ Test successful - auto-linux-32-nopt-t, auto-linux-32-opt, auto-linux-64-nopt-t, auto-linux-64-opt, auto-linux-64-x-android-t, auto-mac-32-opt, auto-mac-64-nopt-t, auto-mac-64-opt, auto-win-gnu-32-nopt-t, auto-win-gnu-32-opt, auto-win-gnu-64-nopt-t, auto-win-gnu-64-opt, auto-win-msvc-32-opt, auto-win-msvc-64-opt

rust-highfive assigned Gankra Jul 13, 2015

bluss reviewed Jul 13, 2015
View reviewed changes

kornelski force-pushed the master branch from d64c563 to 4f8d847 Compare July 13, 2015 21:46

bluss reviewed Jul 13, 2015
View reviewed changes

kornelski force-pushed the master branch from 4f8d847 to 529c388 Compare July 14, 2015 09:40

nagisa reviewed Jul 15, 2015
View reviewed changes

Document Unicode complications in chars iterator

c20e3fc

kornelski force-pushed the master branch from 9ae483c to c20e3fc Compare July 25, 2015 15:02

bors merged commit c20e3fc into rust-lang:master Jul 26, 2015

bors mentioned this pull request Jul 26, 2015

Replace many uses of mem::transmute with more specific functions #27252

Merged

Document Unicode complications when iterating "characters" #27012

Document Unicode complications when iterating "characters" #27012

Uh oh!

Conversation

kornelski commented Jul 13, 2015

Uh oh!

rust-highfive commented Jul 13, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bluss commented Jul 13, 2015

Uh oh!

kornelski commented Jul 13, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bluss commented Jul 13, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kornelski commented Jul 13, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bors commented Jul 14, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bors commented Jul 21, 2015

Uh oh!

Gankra commented Jul 24, 2015

Uh oh!

Gankra commented Jul 24, 2015

Uh oh!

Gankra commented Jul 24, 2015

Uh oh!

SimonSapin commented Jul 24, 2015

Uh oh!

Gankra commented Jul 25, 2015

Uh oh!

kornelski commented Jul 26, 2015

Uh oh!

Gankra commented Jul 26, 2015

Uh oh!

bors commented Jul 26, 2015

Uh oh!

bors commented Jul 26, 2015

Uh oh!

bors commented Jul 26, 2015

Uh oh!

Uh oh!