-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
lowercase(String) doesn't handle case transformation contexts correctly #7848
Comments
Handling this cannot be our job. I'm inclined to only support latin-1 or ascii in Base's case-conversion functions, and leave everything else to libICU. |
Sure, but even working out a correct restriction of |
Our options seem to be
|
(Stupid gihub for changing |
That's a good idea, but there needs to be a default anyway, so we could add a locale parameter later and have it default to ASCII. However I worry about having our own locale system separate from the OS's. I'm also very suspicious about the need for uppercasing and lowercasing in general. Unicode case folding and normalization should be used for string comparison where necessary, and beyond that it seems unusual to need to uppercase a string. The uses of it in Base are few and dubious. For example the printf code uses |
There are a couple uses of lettercase related to standard syntaxes, namely hex numbers and UUIDs. ASCII-only is more than sufficient for that, but in fact inlining those conversions into the parsing code would be faster and less trouble. |
Part of preparing a corpus for a text classification model is standardizing terms. Part of standardization is very often converting all terms into a single case. For my own work, ascii-only uppercase/lowercase would be just fine. |
For term standardization it seems you'd want unicode normalization ( |
Another way to phrase my thoughts: many uses of |
For the short term, how about making it ASCII-only and perhaps noting in the docs that folks who need full-up unicode support can get it from UnicodeExtras.jl? If uppercase(String) and lowercase(String) go, presumably ucfirst(String) and lcfirst(String) should go as well. |
Restricting the input to ASCIIString can still produce wrong results for Turkish locales. (Small i is an ASCII character that gets capitalized to something beyond the ASCII or Latin-1 ranges.) |
The only (pseudo) locale supported by base should be ASCII to support the language as @JeffBezanson says. You never want the language specific parts to depend on locale or the meaning of the program can change. All actual human language/locale related functionality should be in a Unicode package where it can be explicitly used. |
@elextr agreed. I'd prefer for uppercase, lowercase, ucfirst, and lcfirst to be removed from Base. Instead people can use Char arithmetic and/or |
I call bullshit on this. It is just unreasonable to not be able to uppercase and lowercase strings. That is just nonsense. I'm well aware that this has issues, but sometimes you just want to do this and it's just embarrassing if we say "sorry, you can't do that." It would be fine with me if these operations threw an error on ambiguous or impossible cases, but we cannot not support this in the language. |
What does it mean to "just uppercase a string"? That's the whole problem. I'd argue that a lot of uses are buggy. Why does one need to uppercase strings? I'm not even sure what it's for. Again, it's not the right way to do case-insensitive comparison or canonicalization, which is a common application. What are the ambiguous or impossible cases? It's well-defined and depends on locale. |
Well, there are clearly well-defined cases since this is a well-defined operation for ASCII, so we can at least start there. I'm fine with providing better functions, but telling people they can't uppercase ASCII strings is just dumb. Changing case is a fundamental part of data cleaning. I'm fine with your option 4, but option 3 is just ridiculous. I suspect there are many other ranges of Unicode for which this is unproblematic, but I'm ok with requiring ICU to handle that. Does libmojibake not have functions for this? |
As has been pointed out many times in this discussion, case conversion of ASCII characters is also locale-dependent. |
Nope, it is locale-agnostic. |
Also, I agree the A..Z ↔ a..z function has a place, due to its common use for things like parsing hex. If that function had a good name, I think we'd all agree it could be in Base and the rest could be left to a package. |
I would argue that that is a clear mistake in the Unicode standard. ASCII stands for "American Standard Code for Information Interchange", so English capitalization rules should apply to letters in the ASCII range regardless of locale. Moreover, if the letter that looks like "i" in Turkish is capitalized differently than the letter that looks like "i" in English, then it is not the same letter and should have its own code point. If someone wants Turkish capitalization, they should ask for it. By default, we can do the sane thing, which is the normal A..Z ↔ a..z translation. This factoid about Turkish capitalization has been brought up a number of times, but I can't find any references – do you have a link, @jiahao? |
I have to agree with @JeffBezanson that all the use cases of @StefanKarpinski It seems reasonable that ASCII 'i' originally referred to the English small letter 'i', but in the Unicode standard the code point U+0069 ('i') is named LATIN SMALL LETTER I. The Turkish 'i' problem has its own special mention in the Unicode standard (Section 5.18 Case Mappings) (v6.2.0, ch 5 pdf). The special-casing necessitated for Turkish-like locales is mentioned in most i18n libraries, including Microsoft's. |
Interestingly, the original ASCII standard of 1963 did not define lowercase Latin letters (jpg) and makes extensive reference to the possibility of extensions to handle non-English letters. The most modern incarnation of ASCII appears to be ANSI X3.4-1986 == ISO 646:1991 (paywalled standard), which simply defines 'i' as "Small letter i". |
How about calling the A..Z ↔ a..z functions |
This works on all strings, not just ASCIIStrings. It only works on ASCII characters. |
I also think that @ivarne's idea of providing some default locale and letting packages provide more advanced locale-dependent behavior is the right way to go. |
The default "ASCII locale" is really "en_US". I'm fine with making it the default, but let's call it what it is. |
No, the default locale should be C/POSIX. |
@jiahao doesn't "en_US" provide case transformation for characters above 7bit ASCII? |
One might also argue that codepoints above 128 should trigger an error so that non ASCII users gets forced to provide locale? |
The POSIX locale does not, however. It's not as thoroughly defined as one might like, but it's pretty clearly the thing we should be using as a default – you can read more about it here. |
The problem with C/POSIX is that its specified for the portable character set but it is unspecified for all other code points. To unsure locale and platform independence it is necessary to separate the use limited to ASCII from the use for any Unicode. Perhaps using |
I withdraw my comment about |
+1 to @ivarne's suggestion regarding <128, i.e. the ASCII 'standard character set'. @StefanKarpinski : http://www.ascii-codes.com/cp857.html for a description of the Turkish characters. According to this page, the differences apply only in the 128-255 range, While the extended character set is locale dependent, I don't believe the standard character set is. |
This isn't really a problem – we throw an error if someone does something in a locale that isn't well-defined in that locale. In the POSIX locale, which is the default one, we only support lowercase and uppercase (and various other locale-dependent functions) on characters in the "portable character set". |
The documentation for Perl's Python 3 seems to always use Unicode rules, but the effect of the locale is not specified. |
@StefanKarpinski just to be a little clearer: The default behaviour of This is distinct from a platform "C" or "POSIX" locale which is allowed to work for non-ASCII code points (its unspecified, not specified as an error). For example a platform may decide to return non-ASCII code points unchanged, and users of Maybe the default can be called the "Julia" locale :) |
These functions are called Whatever other systems may do for undefined behavior in C/POSIX locale is their business. The general policy in Julia is if something is undefined, doing it should be an error. There are exceptions, but they had better be for a very good reason. |
On 7 August 2014 11:16, Stefan Karpinski notifications@github.com wrote:
Then again maybe different behaviours justifies different names?
I'm just noting that Julia should not prevent the user from accessing I have spent considerable swearing time trying to fix problems where
|
There is a compelling if slightly depressing argument to be made about following the platform libc however wrong it may be. It's worth noting that |
@JeffBezanson the standard definition of libc
As @StefanKarpinski pointed out, the OP was for ICU for example uses different APIs for locale dependent and locale independent transforms though of course its constrained by the capabilities of C. It is also one of the platforms that returns unchanged any character with no valid case transform, even for "C" locale. So perhaps it would be best to define separate Julia specific locale independent APIs and more general locale specific APIs including the "C" locale. Essentially there are valid use-cases for both and both need to be available. |
I meant that a given libc might be incorrectly using a locale-independent implementation anyway, not that it should. My libc does not obey the Turkish locale for uppercase "i" for example. I agree both should be available. A function in Base, when not passed an argument mentioning locale, should probably be locale-independent by default. Locale-dependent functions should follow the platform we're on or another respectable source like ICU. |
Closed in favor of #11471 |
Pretty please reopen this... I think #11471 is really Windows specific breakage, and this is about handling locale dependent vs locale independent case mappings and foldings in Julia... |
@ScottPJones, I think the conclusion was that Julia's Base behavior should remain as-is (locale-independent, now that it is using utf8proc based on the UnicodeData.txt "informative" standard), and that more sophisticated locale-dependent behavior belongs in an ICU package or similar. |
OK, I see what you're getting at... it's not that the issue has been fixed, it's just that it shouldn't be fixed in Base. Do I have that right? |
Yes. |
End-of-word contexts. As reported in unexpected lowercase(Char) result #7847, for example, capital sigma Σ should map to lowercase terminal sigma ς at the end of a word, not ordinary sigma σ:
Locale-specific contexts: let's not get anyone accidentally killed thanks to incorrect localization.
The text was updated successfully, but these errors were encountered: