-
Notifications
You must be signed in to change notification settings - Fork 12.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Document Unicode complications when iterating "characters" #27012
Conversation
Thanks for the pull request, and welcome! The Rust team is excited to review your changes, and you should hear from @gankro (or someone else) soon. If any changes to this PR are deemed necessary, please add them as extra commits. This ensures that the reviewer can see what has changed since they last reviewed the code. The way Github handles out-of-date commits, this should also make it reasonably obvious what issues have or haven't been addressed. Large or tricky changes may require several passes of review and changes. Please see the contribution instructions for more information. |
@@ -599,7 +599,7 @@ impl str { | |||
/// # #![feature(str_char, core)] | |||
/// use std::str::CharRange; | |||
/// | |||
/// let s = "中华Việt Nam"; | |||
/// let s = "中华Việt Nam"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
github's diff rendering is broken, but this is ok in the source.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The diff is not broken, the source was changed to use "decomposed" code points (an ASCII e
followed by combining code points, rather than ệ
as a single code point) as the "This outputs" change below indicates.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(This change illustrates why code points may not be the unit you want.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might depend on the browser? Since it renders the diacritics from the e
on top of the t
in the after version, I think the diff rendering is broken here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, yeah, that’s a rendering bug. I thought you were saying there was not change at all there in the source.
I don't really like maximal pedantism either (everything we say is subtly wrong or not the whole story anyway), but I think Unicode Scalar Value is the only correct term for what we mean. Otherwise we could invent “Acceptable UTF-8 code point”.. We did in fact already invent a term, which is Rust's Unicode standard version 7.0 section 2.4 Code Points and Characters (PDF) is interesting and complicated -- some code points are representable in UTF-8 and some are not, some USVs are abstract characters, some noncharacters and some neither. |
@bluss OK, I've reverted the |
/// | ||
/// If the slice does not contain any characters, None is returned instead. | ||
/// If the slice does not contain any Unicode scalar values, None is returned instead. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could actually just say "if the slice is empty"
Yes I agree, we don't want to read USV everywhere. So we should accept that code point is a subset of USV(?), so everywhere we say we return a code point etc, we are totally fine? |
/// string. | ||
/// | ||
/// Due to the design of UTF-8, this operation is `O(end)`. Use slicing | ||
/// syntax if you want to use byte indices rather than codepoint indices. | ||
/// syntax if you want to use byte indices rather than `char` indices. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh this is confusing -- we have an iterator called char_indices()
which returns byte offsets from the start of the string paired with char
.
OK, I've dialed down USV usage, and clarified the other bits. |
/// For iteration over human-readable characters a [grapheme cluster iterator][1] | ||
/// may be more appropriate. | ||
/// | ||
/// [1]: https://unicode-rs.github.io/unicode-segmentation/unicode_segmentation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This link doesn't actually give the reader any indication where to find & how to use the crate -- crates.io might be more appropriate. I don't know however what we usually do in this situation, do we link out to third party crates?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Funny — I've tried to link to the exact iterator page, but the link exceeds the 100-char line limit!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually meant I'd prefer a link to the crate's page on crates.io. The documentation gives no information on the source, homepage, download location of the crate.
☔ The latest upstream changes (presumably #26241) made this pull request unmergeable. Please resolve the merge conflicts. |
/// | ||
/// # Examples | ||
/// | ||
/// ``` | ||
/// # #![feature(str_char)] | ||
/// let s = "Löwe 老虎 Léopard"; | ||
/// let s = "Łódź"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason to change this example?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, ASCII L
was an easy case that doesn't illustrate the trickyness.
I wanted to show behavior with a mix of composed and decomposed characters (you can see first call takes whole composed Ł
, but the second call only gets o
"half" of ó
, leaving modifier codepoint behind).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder how we can make the normalization-dependent examples clear. A particular codepoint decomposition might not even survive the documentation processing (it might be renormalized), but that's of course that's mostly invisible to a majority of readers anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I could add a comment with the string written using unicode escapes (I avoided using escapes in the example code itself, because it makes code harder to read and without characters rendered you don't see what's going on).
☔ The latest upstream changes (presumably #27168) made this pull request unmergeable. Please resolve the merge conflicts. |
Oh whoops, apparently I've been assigned to this... |
r=me assuming travis is happy with the latest update |
Although honestly I usually defer String problems to @SimonSapin ... |
Looks good to me. |
@pornel just need a squash of the commits |
Squashed |
@bors r+ rollup Thanks! |
📌 Commit c20e3fc has been approved by |
Fixes #26689 This PR tries to clarify uses of "character" where it means "code point" or "UTF-8 sequence", which are almost, but not quite the same. Edge cases added to some examples to demonstrate this. However, I've kept use of the term "code point" instead of "Unicode scalar value", because in UTF-8 they're the same, and "code point" is more widely known.
Fixes #26689
This PR tries to clarify uses of "character" where it means "code point" or "UTF-8 sequence", which are almost, but not quite the same. Edge cases added to some examples to demonstrate this.
However, I've kept use of the term "code point" instead of "Unicode scalar value", because in UTF-8 they're the same, and "code point" is more widely known.