lowercase(String) doesn't handle case transformation contexts correctly #7848

jiahao · 2014-08-05T16:24:37Z

End-of-word contexts. As reported in unexpected lowercase(Char) result #7847, for example, capital sigma Σ should map to lowercase terminal sigma ς at the end of a word, not ordinary sigma σ:
```
julia> lowercase("ΛΌΓΟΣ") #logos; λόγος
"λόγοσ"
```
Locale-specific contexts: let's not get anyone accidentally killed thanks to incorrect localization.
```
$LANG="tr_TR" julia
...
julia> uppercase("sikisinca") #SİKİSİNCA
"SIKISINCA"
```

JeffBezanson · 2014-08-05T16:38:47Z

Handling this cannot be our job. I'm inclined to only support latin-1 or ascii in Base's case-conversion functions, and leave everything else to libICU.

jiahao · 2014-08-05T17:47:32Z

Sure, but even working out a correct restriction of lowercase/uppercase to retain only the current behavior is tricky. At the very least, it means giving up any pretense of working in any Turkish-like locale.

JeffBezanson · 2014-08-05T18:07:21Z

Our options seem to be

Just call platform libc even though it is probably wrong.
Include libICU in base
Remove case conversions from base
Error for any character >0x7f or >0xff, and document that the functions are ASCII-only

ivarne · 2014-08-05T18:40:24Z

Add and document a locale type-parameter to the case transformation functions. The only (psudo)locale provided by Base is ASCII. Different locales can be provided by a ICU package (that might be one of the default packages, if that becomes a thing)

(Stupid gihub for changing 5. to 1.)

JeffBezanson · 2014-08-05T19:09:24Z

That's a good idea, but there needs to be a default anyway, so we could add a locale parameter later and have it default to ASCII. However I worry about having our own locale system separate from the OS's.

I'm also very suspicious about the need for uppercasing and lowercasing in general. Unicode case folding and normalization should be used for string comparison where necessary, and beyond that it seems unusual to need to uppercase a string. The uses of it in Base are few and dubious. For example the printf code uses lowercase(c)=='o', when c in ('o','O') would be shorter and more accurate.

JeffBezanson · 2014-08-05T19:15:39Z

There are a couple uses of lettercase related to standard syntaxes, namely hex numbers and UUIDs. ASCII-only is more than sufficient for that, but in fact inlining those conversions into the parsing code would be faster and less trouble.

catawbasam · 2014-08-05T19:41:27Z

Part of preparing a corpus for a text classification model is standardizing terms. Part of standardization is very often converting all terms into a single case.

For my own work, ascii-only uppercase/lowercase would be just fine.
That won't be true überall / överallt, but perhaps the need is specialized enough to justify putting the functionality into a ICU based package.

JeffBezanson · 2014-08-05T20:23:47Z

For term standardization it seems you'd want unicode normalization (normalize_string) and case folding. IIUC, you only need the terms in a standard format, not uppercase or lowercase specifically.

JeffBezanson · 2014-08-05T21:35:55Z

Another way to phrase my thoughts: many uses of lowercase are actually just puns for the operation "map characters in the range A-Z to corresponding characters in a-z". In fact very often just A-F.

catawbasam · 2014-08-06T01:26:59Z

For the short term, how about making it ASCII-only and perhaps noting in the docs that folks who need full-up unicode support can get it from UnicodeExtras.jl?

If uppercase(String) and lowercase(String) go, presumably ucfirst(String) and lcfirst(String) should go as well.

jiahao · 2014-08-06T01:37:50Z

Restricting the input to ASCIIString can still produce wrong results for Turkish locales. (Small i is an ASCII character that gets capitalized to something beyond the ASCII or Latin-1 ranges.)

elextr · 2014-08-06T01:43:34Z

The only (pseudo) locale supported by base should be ASCII to support the language as @JeffBezanson says. You never want the language specific parts to depend on locale or the meaning of the program can change. All actual human language/locale related functionality should be in a Unicode package where it can be explicitly used.

JeffBezanson · 2014-08-06T02:07:35Z

@elextr agreed. I'd prefer for uppercase, lowercase, ucfirst, and lcfirst to be removed from Base. Instead people can use Char arithmetic and/or map. In systems-programming uses of {upper,lower}case, people do not want "uppercase i according to the current locale", they just want "i" mapped to "I".

StefanKarpinski · 2014-08-06T02:09:48Z

I call bullshit on this. It is just unreasonable to not be able to uppercase and lowercase strings. That is just nonsense. I'm well aware that this has issues, but sometimes you just want to do this and it's just embarrassing if we say "sorry, you can't do that." It would be fine with me if these operations threw an error on ambiguous or impossible cases, but we cannot not support this in the language.

JeffBezanson · 2014-08-06T02:17:29Z

What does it mean to "just uppercase a string"? That's the whole problem.

I'd argue that a lot of uses are buggy. lowercase(c) == 'a' is not right when what you want is c in ('a','A').

Why does one need to uppercase strings? I'm not even sure what it's for. Again, it's not the right way to do case-insensitive comparison or canonicalization, which is a common application.

What are the ambiguous or impossible cases? It's well-defined and depends on locale.

StefanKarpinski · 2014-08-06T02:24:16Z

Well, there are clearly well-defined cases since this is a well-defined operation for ASCII, so we can at least start there. I'm fine with providing better functions, but telling people they can't uppercase ASCII strings is just dumb. Changing case is a fundamental part of data cleaning. I'm fine with your option 4, but option 3 is just ridiculous. I suspect there are many other ranges of Unicode for which this is unproblematic, but I'm ok with requiring ICU to handle that. Does libmojibake not have functions for this?

JeffBezanson · 2014-08-06T02:25:29Z

As has been pointed out many times in this discussion, case conversion of ASCII characters is also locale-dependent.

jiahao · 2014-08-06T02:25:59Z

Does libmojibake not have functions for this?

Nope, it is locale-agnostic.

JeffBezanson · 2014-08-06T02:30:33Z

Also, I agree the A..Z ↔ a..z function has a place, due to its common use for things like parsing hex. If that function had a good name, I think we'd all agree it could be in Base and the rest could be left to a package.

StefanKarpinski · 2014-08-06T02:33:23Z

I would argue that that is a clear mistake in the Unicode standard. ASCII stands for "American Standard Code for Information Interchange", so English capitalization rules should apply to letters in the ASCII range regardless of locale. Moreover, if the letter that looks like "i" in Turkish is capitalized differently than the letter that looks like "i" in English, then it is not the same letter and should have its own code point. If someone wants Turkish capitalization, they should ask for it. By default, we can do the sane thing, which is the normal A..Z ↔ a..z translation. This factoid about Turkish capitalization has been brought up a number of times, but I can't find any references – do you have a link, @jiahao?

jiahao · 2014-08-06T02:48:32Z

I have to agree with @JeffBezanson that all the use cases of lowercase and uppercase in Base are convenient puns for mapping between 'a':'z' and 'A':'Z', with the sole exception of lowercase being used as a sorting transformation in Pkg. Arguably the last set of use cases should be replaced with casefolding from libmojibake.

@StefanKarpinski It seems reasonable that ASCII 'i' originally referred to the English small letter 'i', but in the Unicode standard the code point U+0069 ('i') is named LATIN SMALL LETTER I. The Turkish 'i' problem has its own special mention in the Unicode standard (Section 5.18 Case Mappings) (v6.2.0, ch 5 pdf). The special-casing necessitated for Turkish-like locales is mentioned in most i18n libraries, including Microsoft's.

jiahao · 2014-08-06T03:03:04Z

Interestingly, the original ASCII standard of 1963 did not define lowercase Latin letters (jpg) and makes extensive reference to the possibility of extensions to handle non-English letters. The most modern incarnation of ASCII appears to be ANSI X3.4-1986 == ISO 646:1991 (paywalled standard), which simply defines 'i' as "Small letter i".

jiahao · 2014-08-06T03:22:49Z

How about calling the A..Z ↔ a..z functions asciiupper and asciilower and defining them only on ASCIIStrings? Hopefully it conveys the idea that the result is always ASCII-representable.

StefanKarpinski · 2014-08-06T03:30:17Z

This works on all strings, not just ASCIIStrings. It only works on ASCII characters.

StefanKarpinski · 2014-08-06T03:32:26Z

I also think that @ivarne's idea of providing some default locale and letting packages provide more advanced locale-dependent behavior is the right way to go.

jiahao · 2014-08-06T04:12:37Z

The default "ASCII locale" is really "en_US". I'm fine with making it the default, but let's call it what it is.

StefanKarpinski · 2014-08-06T04:48:03Z

No, the default locale should be C/POSIX.

ivarne · 2014-08-06T04:48:21Z

@jiahao doesn't "en_US" provide case transformation for characters above 7bit ASCII?

ivarne · 2014-08-06T04:51:17Z

One might also argue that codepoints above 128 should trigger an error so that non ASCII users gets forced to provide locale?

StefanKarpinski · 2014-08-06T04:53:12Z

The POSIX locale does not, however. It's not as thoroughly defined as one might like, but it's pretty clearly the thing we should be using as a default – you can read more about it here.

elextr · 2014-08-06T05:12:31Z

The problem with C/POSIX is that its specified for the portable character set but it is unspecified for all other code points. To unsure locale and platform independence it is necessary to separate the use limited to ASCII from the use for any Unicode. Perhaps using toupper or tolower without any locale specified would use a built-in version that threw errors on non-ASCII characters, but passed to the implementation in package Unicode if "C" or "POSIX" is specified explicitly.

jiahao · 2014-08-06T05:27:52Z

I withdraw my comment about en_US; apparently its definition and underlying character set is distro-specific. (en_US.ASCII has a closer definition to what I had in mind, but that's almost equivalent to C/POSIX and the latter is at least spec-ed.)

catawbasam · 2014-08-06T12:35:00Z

+1 to @ivarne's suggestion regarding <128, i.e. the ASCII 'standard character set'.

@StefanKarpinski : http://www.ascii-codes.com/cp857.html for a description of the Turkish characters. According to this page, the differences apply only in the 128-255 range,
i.e. the extended character set.

While the extended character set is locale dependent, I don't believe the standard character set is.

StefanKarpinski · 2014-08-06T14:25:29Z

The problem with C/POSIX is that its specified for the portable character set but it is unspecified for all other code points. To unsure locale and platform independence it is necessary to separate the use limited to ASCII from the use for any Unicode. Perhaps using toupper or tolower without any locale specified would use a built-in version that threw errors on non-ASCII characters, but passed to the implementation in package Unicode if "C" or "POSIX" is specified explicitly.

This isn't really a problem – we throw an error if someone does something in a locale that isn't well-defined in that locale. In the POSIX locale, which is the default one, we only support lowercase and uppercase (and various other locale-dependent functions) on characters in the "portable character set".

nalimilan · 2014-08-06T18:56:43Z

The documentation for Perl's lc (lowercase) is interesting: http://perldoc.perl.org/functions/lc.html They appear to change only ASCII characters by default, leaving other characters as they are. For Julia, I agree that raising an error is more reasonable, at least by default. Calling use locale in Perl enables locale-dependent Unicode rules. This could be implemented in Julia as an argument to toupper in a Unicode package.

Python 3 seems to always use Unicode rules, but the effect of the locale is not specified.

elextr · 2014-08-07T00:34:04Z

@StefanKarpinski just to be a little clearer:

The default behaviour of toupper and tolower should be locale and platform independent, defined for the POSIX portable character set only and throws errors for other code points.

This is distinct from a platform "C" or "POSIX" locale which is allowed to work for non-ASCII code points (its unspecified, not specified as an error). For example a platform may decide to return non-ASCII code points unchanged, and users of toupper and tolower may wish to use that behaviour so their program matches other software on the platform, thats why I distinguished between the default behaviour above (which is platform independent) and the explicit specification of "C" and "POSIX" using the platform definition.

Maybe the default can be called the "Julia" locale :)

StefanKarpinski · 2014-08-07T01:16:30Z

These functions are called lowercase and uppercase, not tolower and toupper, everyone :-)

Whatever other systems may do for undefined behavior in C/POSIX locale is their business. The general policy in Julia is if something is undefined, doing it should be an error. There are exceptions, but they had better be for a very good reason.

elextr · 2014-08-07T01:57:44Z

On 7 August 2014 11:16, Stefan Karpinski notifications@github.com wrote:

These functions are called lowercase and uppercase, not tolower and
toupper, everyone :-)

Baaa, just following the flock :-)

Then again maybe different behaviours justifies different names?

Whatever other systems may do for undefined behavior in C/POSIX locale is
their business. The general policy in Julia is if something is undefined,
doing it should be an error. There are exceptions, but they had better be
for a very good reason.

Agree, the default should follow Julia norms.

I'm just noting that Julia should not prevent the user from accessing
their platform "C/POSIX" behaviour by hiding it with its default. I
suggest that the most natural implementation is having explicit requests
for those locales use the platform implementation not the Julia defined
one, but YMMV.

I have spent considerable swearing time trying to fix problems where
software has (with the best of intentions) only allowed its definition of
locale specific features and so given differing results from other software
on the same platform, so maybe I'm a bit sensitive on this point.

—
Reply to this email directly or view it on GitHub
#7848 (comment).

JeffBezanson · 2014-08-07T02:08:09Z

There is a compelling if slightly depressing argument to be made about following the platform libc however wrong it may be.

It's worth noting that UnicodeData.txt specifies default case mappings for all upper- and lowercase letters. Technically these fields are classified as informative, but they still form a standard of a kind. I'd wager at least one of our C libraries is using that data in towupper.

elextr · 2014-08-07T02:50:37Z

@JeffBezanson the standard definition of libc toupper and tolower (where the names came from) are locale dependent, so libc is not suitable for the locale independent implementation.

UnicodeData.txt provides case mappings for characters so it can only return a single character as an answer like toupper and tolower.

As @StefanKarpinski pointed out, the OP was for lowercase and uppercase, and they are strings, so allowing different length results, as defined in Unicode SpecialCasing.txt. Some of this is locale independent (ß to SS) and some of which (the notorious Turkish "i" for example) is locale dependent.

ICU for example uses different APIs for locale dependent and locale independent transforms though of course its constrained by the capabilities of C. It is also one of the platforms that returns unchanged any character with no valid case transform, even for "C" locale.

So perhaps it would be best to define separate Julia specific locale independent APIs and more general locale specific APIs including the "C" locale.

Essentially there are valid use-cases for both and both need to be available.

JeffBezanson · 2014-08-07T03:22:46Z

I meant that a given libc might be incorrectly using a locale-independent implementation anyway, not that it should. My libc does not obey the Turkish locale for uppercase "i" for example.

I agree both should be available. A function in Base, when not passed an argument mentioning locale, should probably be locale-independent by default. Locale-dependent functions should follow the platform we're on or another respectable source like ICU.

jiahao · 2015-05-28T15:20:23Z

Closed in favor of #11471

ScottPJones · 2015-05-28T21:57:55Z

Pretty please reopen this... I think #11471 is really Windows specific breakage, and this is about handling locale dependent vs locale independent case mappings and foldings in Julia...

stevengj · 2015-06-01T17:13:29Z

@ScottPJones, I think the conclusion was that Julia's Base behavior should remain as-is (locale-independent, now that it is using utf8proc based on the UnicodeData.txt "informative" standard), and that more sophisticated locale-dependent behavior belongs in an ICU package or similar.

ScottPJones · 2015-06-01T17:30:50Z

OK, I see what you're getting at... it's not that the issue has been fixed, it's just that it shouldn't be fixed in Base. Do I have that right?

stevengj · 2015-06-01T21:42:28Z

Yes.

jiahao added the unicode label Aug 5, 2014

catawbasam mentioned this issue Aug 13, 2014

islower(Char) incorrect on Windows #7988

Closed

jiahao mentioned this issue Mar 1, 2015

RFC: Added test/i18n.jl: simple but full coverage of base/i18.jl #10364

Merged

tkelman mentioned this issue Mar 20, 2015

Performance improvement to eliminate some tolower calls libgit2/libgit2#3001

Closed

jiahao mentioned this issue May 28, 2015

uppercase/lowercase functions are not portable? #11471

Closed

jiahao closed this as completed May 28, 2015

stevengj mentioned this issue May 30, 2015

switch to utf8proc's upper/lowercase functions #11493

Merged

noe mentioned this issue Jan 23, 2020

Add lowercase script? hplt-project/sacremoses#82

Open

lowercase(String) doesn't handle case transformation contexts correctly #7848

lowercase(String) doesn't handle case transformation contexts correctly #7848

Comments

jiahao commented Aug 5, 2014

JeffBezanson commented Aug 5, 2014

jiahao commented Aug 5, 2014

JeffBezanson commented Aug 5, 2014

ivarne commented Aug 5, 2014

JeffBezanson commented Aug 5, 2014

JeffBezanson commented Aug 5, 2014

catawbasam commented Aug 5, 2014

JeffBezanson commented Aug 5, 2014

JeffBezanson commented Aug 5, 2014

catawbasam commented Aug 6, 2014

jiahao commented Aug 6, 2014

elextr commented Aug 6, 2014

JeffBezanson commented Aug 6, 2014

StefanKarpinski commented Aug 6, 2014

JeffBezanson commented Aug 6, 2014

StefanKarpinski commented Aug 6, 2014

JeffBezanson commented Aug 6, 2014

jiahao commented Aug 6, 2014

JeffBezanson commented Aug 6, 2014

StefanKarpinski commented Aug 6, 2014

jiahao commented Aug 6, 2014

jiahao commented Aug 6, 2014

jiahao commented Aug 6, 2014

StefanKarpinski commented Aug 6, 2014

StefanKarpinski commented Aug 6, 2014

jiahao commented Aug 6, 2014

StefanKarpinski commented Aug 6, 2014

ivarne commented Aug 6, 2014

ivarne commented Aug 6, 2014

StefanKarpinski commented Aug 6, 2014

elextr commented Aug 6, 2014

jiahao commented Aug 6, 2014

catawbasam commented Aug 6, 2014

StefanKarpinski commented Aug 6, 2014

nalimilan commented Aug 6, 2014

elextr commented Aug 7, 2014

StefanKarpinski commented Aug 7, 2014

elextr commented Aug 7, 2014

JeffBezanson commented Aug 7, 2014

elextr commented Aug 7, 2014

JeffBezanson commented Aug 7, 2014

jiahao commented May 28, 2015

ScottPJones commented May 28, 2015

stevengj commented Jun 1, 2015

ScottPJones commented Jun 1, 2015

stevengj commented Jun 1, 2015