Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document Unicode complications when iterating "characters" #27012

Merged
merged 1 commit into from
Jul 26, 2015

Conversation

kornelski
Copy link
Contributor

Fixes #26689

This PR tries to clarify uses of "character" where it means "code point" or "UTF-8 sequence", which are almost, but not quite the same. Edge cases added to some examples to demonstrate this.

However, I've kept use of the term "code point" instead of "Unicode scalar value", because in UTF-8 they're the same, and "code point" is more widely known.

@rust-highfive
Copy link
Collaborator

Thanks for the pull request, and welcome! The Rust team is excited to review your changes, and you should hear from @gankro (or someone else) soon.

If any changes to this PR are deemed necessary, please add them as extra commits. This ensures that the reviewer can see what has changed since they last reviewed the code. The way Github handles out-of-date commits, this should also make it reasonably obvious what issues have or haven't been addressed. Large or tricky changes may require several passes of review and changes.

Please see the contribution instructions for more information.

@@ -599,7 +599,7 @@ impl str {
/// # #![feature(str_char, core)]
/// use std::str::CharRange;
///
/// let s = "中华Việt Nam";
/// let s = "中华Việt Nam";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

github's diff rendering is broken, but this is ok in the source.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The diff is not broken, the source was changed to use "decomposed" code points (an ASCII e followed by combining code points, rather than as a single code point) as the "This outputs" change below indicates.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(This change illustrates why code points may not be the unit you want.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might depend on the browser? Since it renders the diacritics from the e on top of the t in the after version, I think the diff rendering is broken here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, yeah, that’s a rendering bug. I thought you were saying there was not change at all there in the source.

@bluss
Copy link
Member

bluss commented Jul 13, 2015

I don't really like maximal pedantism either (everything we say is subtly wrong or not the whole story anyway), but I think Unicode Scalar Value is the only correct term for what we mean. Otherwise we could invent “Acceptable UTF-8 code point”.. We did in fact already invent a term, which is Rust's char, so I like that.

Unicode standard version 7.0 section 2.4 Code Points and Characters (PDF) is interesting and complicated -- some code points are representable in UTF-8 and some are not, some USVs are abstract characters, some noncharacters and some neither.

@kornelski
Copy link
Contributor Author

@bluss OK, I've reverted the \r example and changed code points to Unicode scalar values. It does sound a bit awkward though.

///
/// If the slice does not contain any characters, None is returned instead.
/// If the slice does not contain any Unicode scalar values, None is returned instead.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could actually just say "if the slice is empty"

@bluss
Copy link
Member

bluss commented Jul 13, 2015

Yes I agree, we don't want to read USV everywhere. So we should accept that code point is a subset of USV(?), so everywhere we say we return a code point etc, we are totally fine?

/// string.
///
/// Due to the design of UTF-8, this operation is `O(end)`. Use slicing
/// syntax if you want to use byte indices rather than codepoint indices.
/// syntax if you want to use byte indices rather than `char` indices.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh this is confusing -- we have an iterator called char_indices() which returns byte offsets from the start of the string paired with char.

@kornelski
Copy link
Contributor Author

OK, I've dialed down USV usage, and clarified the other bits.

/// For iteration over human-readable characters a [grapheme cluster iterator][1]
/// may be more appropriate.
///
/// [1]: https://unicode-rs.github.io/unicode-segmentation/unicode_segmentation
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This link doesn't actually give the reader any indication where to find & how to use the crate -- crates.io might be more appropriate. I don't know however what we usually do in this situation, do we link out to third party crates?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Funny — I've tried to link to the exact iterator page, but the link exceeds the 100-char line limit!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually meant I'd prefer a link to the crate's page on crates.io. The documentation gives no information on the source, homepage, download location of the crate.

@bors
Copy link
Contributor

bors commented Jul 14, 2015

☔ The latest upstream changes (presumably #26241) made this pull request unmergeable. Please resolve the merge conflicts.

///
/// # Examples
///
/// ```
/// # #![feature(str_char)]
/// let s = "Löwe 老虎 Léopard";
/// let s = "Łódź";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason to change this example?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, ASCII L was an easy case that doesn't illustrate the trickyness.

I wanted to show behavior with a mix of composed and decomposed characters (you can see first call takes whole composed Ł, but the second call only gets o "half" of ó, leaving modifier codepoint behind).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder how we can make the normalization-dependent examples clear. A particular codepoint decomposition might not even survive the documentation processing (it might be renormalized), but that's of course that's mostly invisible to a majority of readers anyway.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I could add a comment with the string written using unicode escapes (I avoided using escapes in the example code itself, because it makes code harder to read and without characters rendered you don't see what's going on).

@bors
Copy link
Contributor

bors commented Jul 21, 2015

☔ The latest upstream changes (presumably #27168) made this pull request unmergeable. Please resolve the merge conflicts.

@Gankra
Copy link
Contributor

Gankra commented Jul 24, 2015

Oh whoops, apparently I've been assigned to this...

@Gankra
Copy link
Contributor

Gankra commented Jul 24, 2015

r=me assuming travis is happy with the latest update

@Gankra
Copy link
Contributor

Gankra commented Jul 24, 2015

Although honestly I usually defer String problems to @SimonSapin ...

@SimonSapin
Copy link
Contributor

Looks good to me.

@Gankra
Copy link
Contributor

Gankra commented Jul 25, 2015

@pornel just need a squash of the commits

@kornelski
Copy link
Contributor Author

Squashed

@Gankra
Copy link
Contributor

Gankra commented Jul 26, 2015

@bors r+ rollup

Thanks!

@bors
Copy link
Contributor

bors commented Jul 26, 2015

📌 Commit c20e3fc has been approved by Gankro

@bors
Copy link
Contributor

bors commented Jul 26, 2015

⌛ Testing commit c20e3fc with merge 6232f95...

bors added a commit that referenced this pull request Jul 26, 2015
Fixes #26689

This PR tries to clarify uses of "character" where it means "code point" or "UTF-8 sequence", which are almost, but not quite the same. Edge cases added to some examples to demonstrate this.

However, I've kept use of the term "code point" instead of "Unicode scalar value", because in UTF-8 they're the same, and "code point" is more widely known.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants