-
Notifications
You must be signed in to change notification settings - Fork 12.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Corrected very minor documentation detail about Unicode and Japanese #40499
Conversation
Thanks for the pull request, and welcome! The Rust team is excited to review your changes, and you should hear from @steveklabnik (or someone else) soon. If any changes to this PR are deemed necessary, please add them as extra commits. This ensures that the reviewer can see what has changed since they last reviewed the code. Due to the way GitHub handles out-of-date commits, this should also make it reasonably obvious what issues have or haven't been addressed. Large or tricky changes may require several passes of review and changes. Please see the contribution instructions for more information. |
Interesting, I've never heard of this. I'm gonna look into it tomorrow but if anyone else wants to r+ before then, feel free. |
The wiki page Unicode Equivalence under the subtitle 'Typographic Conventions' has some more details. |
FULL WIDTH LATIN {SMALL,CAPITAL} LETTER A is still a Latin letter from the Latin script. One can attribute exactly 2 scripts to Japanese writing system kanji and kana. Neither of those have case and therefore the previous statement is just fine. Now, I'm totally fine with making a change like this, but attributing logographs used in the whole CJK to Japanese seems... Unfair I guess? How about we just use a kana (これ) instead of the current kanji for the example? |
Sounds good to me. |
Its not that cut and dried. Unicode is hard because we are dealing with human languages in all their complexity. By changing the documentation from 'Japanese' to 'Japanese kanji' we can avoid that complexity.
I can't see the value in changing from kanji to hiragana, it doesn't change anything. Anyway, 山 is a nice character. |
My little advice: how about using "CJK characters" (or CJKV characters?) instead of "Japanese kanji characters"? Since these characters are used widely in chinese & japanese & korean (and vietnamese), not only japanese. |
How about
? |
Looks good. |
@bors: r+ rollup thanks ! |
📌 Commit 18a8494 has been approved by |
Corrected very minor documentation detail about Unicode and Japanese Japanese half-width and full-width romaji characters do have upper and lowercase according Unicode (but other Japanese characters do not). For example, ` assert_eq!('\u{FF21}'.to_lowercase().collect::<String>(),"\u{FF41}");` r? @steveklabnik
Corrected very minor documentation detail about Unicode and Japanese Japanese half-width and full-width romaji characters do have upper and lowercase according Unicode (but other Japanese characters do not). For example, ` assert_eq!('\u{FF21}'.to_lowercase().collect::<String>(),"\u{FF41}");` r? @steveklabnik
Corrected very minor documentation detail about Unicode and Japanese Japanese half-width and full-width romaji characters do have upper and lowercase according Unicode (but other Japanese characters do not). For example, ` assert_eq!('\u{FF21}'.to_lowercase().collect::<String>(),"\u{FF41}");` r? @steveklabnik
Corrected very minor documentation detail about Unicode and Japanese Japanese half-width and full-width romaji characters do have upper and lowercase according Unicode (but other Japanese characters do not). For example, ` assert_eq!('\u{FF21}'.to_lowercase().collect::<String>(),"\u{FF41}");` r? @steveklabnik
Corrected very minor documentation detail about Unicode and Japanese Japanese half-width and full-width romaji characters do have upper and lowercase according Unicode (but other Japanese characters do not). For example, ` assert_eq!('\u{FF21}'.to_lowercase().collect::<String>(),"\u{FF41}");` r? @steveklabnik
Late to the party, but is this a valid explanation of
That is, there's no uppercase ligature FF in Unicode (to be clear, I'm concerned about the wording "do not have both uppercase and lowercase".) The same almost applies to
(Note the asymmetry here --- the uppercase eszett ẞ is non-orthographic in modern German) |
Well, this explanation is discussing the case-less characters. Both of these ligatures are in caseful, it is just the case of unicode having no assigned codepoint for the uppercase variant of the ligatures you’ve given as an example. |
First, that assumption isn't evident from the text. Second, it isn't a good idea to focus on the "caseful/caseless" dichotomy because the input being caseful is only a necessary condition for any of casing conversions to be defined. E.g.
I think all we can say is
Wdyt?
So... you're actually supporting my claim, right? They "do not have both uppercase and lowercase" and yet don't "convert into themselves." |
In my comment I’ve very purposefully used “character”(1) to mean a real character used in a language out there somewhere and “code point”(2) to mean an assigned code point in Unicode. That is, what I’m really saying that this text should (and, I think, it currently is, due to its use of the word “character”) be discussing the real world characters. I’m very open to improving the wording and/or making it more obvious. As per unicode glossary: (1): The smallest component of written language that has semantic value; refers to the abstract meaning and/or shape, rather than a specific shape (see also glyph), though in code tables some form of visual representation is essential for the reader’s understanding. |
Japanese half-width and full-width romaji characters do have upper and lowercase according Unicode (but other Japanese characters do not). For example,
assert_eq!('\u{FF21}'.to_lowercase().collect::<String>(),"\u{FF41}");
r? @steveklabnik