-
Notifications
You must be signed in to change notification settings - Fork 13.3k
to_ascii_uppercase
and to_ascii_lowercase
operate on non-ASCII characters
#31203
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
This is the documented behavior: http://doc.rust-lang.org/std/ascii/trait.AsciiExt.html#tymethod.to_ascii_uppercase /cc @rust-lang/libs |
In case anyone is curious why this works: |
@steveklabnik It would be nice if the documentation could mention that it operates on codepoints, not grapheme clusters. |
Also, worth considering adding one of the examples above to go with it |
I think it's confusing that it talks about characters when it exhibits the weird behaviour mentioned above. Maybe it could just explicitly codepoints? |
|
Yes, it's confusing naming unfortunately. Maybe we could add these unexpected test cases to the documentation? |
For what it's worth, although I'd prefer this method do "the right thing", simply having an example to make the behaviour with respect to combining code points clear would be a reasonable solution. |
@steveklabnik Note that this function may transform "é" to "É", which is the "ascii violating" behavior It needs pointing out in the doc. However, we don't need to mention grapheme clusters. It is application dependent which algorithms you use over your unicode data. Implying that grapheme clusters is always the right thing to do is misleading. Rust strings are brilliant as they are in providing minimal unicode consistency with low overhead. |
Yes, I would also think that examples of this are a good idea. |
I took a stab at this: #31401 |
Behold!
This is obviously silly. The problem is that this is running an ASCII-only operation on Unicode strings without actually dealing with their Unicode-ness.
These functions should either correctly deal with grapheme clusters (by ignoring them since they're not in ASCII), or document that it does not correctly handle grapheme clusters, preferably with an example (like the above).
(Actually having a standard
Ascii
type would be even more muchly preferable, but I suspect that's way out of scope.)The text was updated successfully, but these errors were encountered: