Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lowercase(String) doesn't handle case transformation contexts correctly #7848

Closed
jiahao opened this issue Aug 5, 2014 · 46 comments
Closed
Labels
unicode Related to unicode characters and encodings

Comments

@jiahao
Copy link
Member

jiahao commented Aug 5, 2014

  1. End-of-word contexts. As reported in unexpected lowercase(Char) result #7847, for example, capital sigma Σ should map to lowercase terminal sigma ς at the end of a word, not ordinary sigma σ:

    julia> lowercase("ΛΌΓΟΣ") #logos; λόγος
    "λόγοσ"
  2. Locale-specific contexts: let's not get anyone accidentally killed thanks to incorrect localization.

    $LANG="tr_TR" julia
    ...
    julia> uppercase("sikisinca") #SİKİSİNCA
    "SIKISINCA"
@jiahao jiahao added the unicode label Aug 5, 2014
@JeffBezanson
Copy link
Member

Handling this cannot be our job. I'm inclined to only support latin-1 or ascii in Base's case-conversion functions, and leave everything else to libICU.

@jiahao
Copy link
Member Author

jiahao commented Aug 5, 2014

Sure, but even working out a correct restriction of lowercase/uppercase to retain only the current behavior is tricky. At the very least, it means giving up any pretense of working in any Turkish-like locale.

@JeffBezanson
Copy link
Member

Our options seem to be

  1. Just call platform libc even though it is probably wrong.
  2. Include libICU in base
  3. Remove case conversions from base
  4. Error for any character >0x7f or >0xff, and document that the functions are ASCII-only

@ivarne
Copy link
Member

ivarne commented Aug 5, 2014

  1. Add and document a locale type-parameter to the case transformation functions. The only (psudo)locale provided by Base is ASCII. Different locales can be provided by a ICU package (that might be one of the default packages, if that becomes a thing)

(Stupid gihub for changing 5. to 1.)

@JeffBezanson
Copy link
Member

That's a good idea, but there needs to be a default anyway, so we could add a locale parameter later and have it default to ASCII. However I worry about having our own locale system separate from the OS's.

I'm also very suspicious about the need for uppercasing and lowercasing in general. Unicode case folding and normalization should be used for string comparison where necessary, and beyond that it seems unusual to need to uppercase a string. The uses of it in Base are few and dubious. For example the printf code uses lowercase(c)=='o', when c in ('o','O') would be shorter and more accurate.

@JeffBezanson
Copy link
Member

There are a couple uses of lettercase related to standard syntaxes, namely hex numbers and UUIDs. ASCII-only is more than sufficient for that, but in fact inlining those conversions into the parsing code would be faster and less trouble.

@catawbasam
Copy link
Contributor

Part of preparing a corpus for a text classification model is standardizing terms. Part of standardization is very often converting all terms into a single case.

For my own work, ascii-only uppercase/lowercase would be just fine.
That won't be true überall / överallt, but perhaps the need is specialized enough to justify putting the functionality into a ICU based package.

@JeffBezanson
Copy link
Member

For term standardization it seems you'd want unicode normalization (normalize_string) and case folding. IIUC, you only need the terms in a standard format, not uppercase or lowercase specifically.

@JeffBezanson
Copy link
Member

Another way to phrase my thoughts: many uses of lowercase are actually just puns for the operation "map characters in the range A-Z to corresponding characters in a-z". In fact very often just A-F.

@catawbasam
Copy link
Contributor

For the short term, how about making it ASCII-only and perhaps noting in the docs that folks who need full-up unicode support can get it from UnicodeExtras.jl?

If uppercase(String) and lowercase(String) go, presumably ucfirst(String) and lcfirst(String) should go as well.

@jiahao
Copy link
Member Author

jiahao commented Aug 6, 2014

Restricting the input to ASCIIString can still produce wrong results for Turkish locales. (Small i is an ASCII character that gets capitalized to something beyond the ASCII or Latin-1 ranges.)

@elextr
Copy link

elextr commented Aug 6, 2014

The only (pseudo) locale supported by base should be ASCII to support the language as @JeffBezanson says. You never want the language specific parts to depend on locale or the meaning of the program can change. All actual human language/locale related functionality should be in a Unicode package where it can be explicitly used.

@JeffBezanson
Copy link
Member

@elextr agreed. I'd prefer for uppercase, lowercase, ucfirst, and lcfirst to be removed from Base. Instead people can use Char arithmetic and/or map. In systems-programming uses of {upper,lower}case, people do not want "uppercase i according to the current locale", they just want "i" mapped to "I".

@StefanKarpinski
Copy link
Member

I call bullshit on this. It is just unreasonable to not be able to uppercase and lowercase strings. That is just nonsense. I'm well aware that this has issues, but sometimes you just want to do this and it's just embarrassing if we say "sorry, you can't do that." It would be fine with me if these operations threw an error on ambiguous or impossible cases, but we cannot not support this in the language.

@JeffBezanson
Copy link
Member

What does it mean to "just uppercase a string"? That's the whole problem.

I'd argue that a lot of uses are buggy. lowercase(c) == 'a' is not right when what you want is c in ('a','A').

Why does one need to uppercase strings? I'm not even sure what it's for. Again, it's not the right way to do case-insensitive comparison or canonicalization, which is a common application.

What are the ambiguous or impossible cases? It's well-defined and depends on locale.

@StefanKarpinski
Copy link
Member

Well, there are clearly well-defined cases since this is a well-defined operation for ASCII, so we can at least start there. I'm fine with providing better functions, but telling people they can't uppercase ASCII strings is just dumb. Changing case is a fundamental part of data cleaning. I'm fine with your option 4, but option 3 is just ridiculous. I suspect there are many other ranges of Unicode for which this is unproblematic, but I'm ok with requiring ICU to handle that. Does libmojibake not have functions for this?

@JeffBezanson
Copy link
Member

As has been pointed out many times in this discussion, case conversion of ASCII characters is also locale-dependent.

@jiahao
Copy link
Member Author

jiahao commented Aug 6, 2014

Does libmojibake not have functions for this?

Nope, it is locale-agnostic.

@JeffBezanson
Copy link
Member

Also, I agree the A..Z ↔ a..z function has a place, due to its common use for things like parsing hex. If that function had a good name, I think we'd all agree it could be in Base and the rest could be left to a package.

@StefanKarpinski
Copy link
Member

I would argue that that is a clear mistake in the Unicode standard. ASCII stands for "American Standard Code for Information Interchange", so English capitalization rules should apply to letters in the ASCII range regardless of locale. Moreover, if the letter that looks like "i" in Turkish is capitalized differently than the letter that looks like "i" in English, then it is not the same letter and should have its own code point. If someone wants Turkish capitalization, they should ask for it. By default, we can do the sane thing, which is the normal A..Z ↔ a..z translation. This factoid about Turkish capitalization has been brought up a number of times, but I can't find any references – do you have a link, @jiahao?

@jiahao
Copy link
Member Author

jiahao commented Aug 6, 2014

I have to agree with @JeffBezanson that all the use cases of lowercase and uppercase in Base are convenient puns for mapping between 'a':'z' and 'A':'Z', with the sole exception of lowercase being used as a sorting transformation in Pkg. Arguably the last set of use cases should be replaced with casefolding from libmojibake.

@StefanKarpinski It seems reasonable that ASCII 'i' originally referred to the English small letter 'i', but in the Unicode standard the code point U+0069 ('i') is named LATIN SMALL LETTER I. The Turkish 'i' problem has its own special mention in the Unicode standard (Section 5.18 Case Mappings) (v6.2.0, ch 5 pdf). The special-casing necessitated for Turkish-like locales is mentioned in most i18n libraries, including Microsoft's.

@jiahao
Copy link
Member Author

jiahao commented Aug 6, 2014

Interestingly, the original ASCII standard of 1963 did not define lowercase Latin letters (jpg) and makes extensive reference to the possibility of extensions to handle non-English letters. The most modern incarnation of ASCII appears to be ANSI X3.4-1986 == ISO 646:1991 (paywalled standard), which simply defines 'i' as "Small letter i".

@jiahao
Copy link
Member Author

jiahao commented Aug 6, 2014

How about calling the A..Z ↔ a..z functions asciiupper and asciilower and defining them only on ASCIIStrings? Hopefully it conveys the idea that the result is always ASCII-representable.

@StefanKarpinski
Copy link
Member

This works on all strings, not just ASCIIStrings. It only works on ASCII characters.

@StefanKarpinski
Copy link
Member

I also think that @ivarne's idea of providing some default locale and letting packages provide more advanced locale-dependent behavior is the right way to go.

@jiahao
Copy link
Member Author

jiahao commented Aug 6, 2014

The default "ASCII locale" is really "en_US". I'm fine with making it the default, but let's call it what it is.

@StefanKarpinski
Copy link
Member

No, the default locale should be C/POSIX.

@ivarne
Copy link
Member

ivarne commented Aug 6, 2014

@jiahao doesn't "en_US" provide case transformation for characters above 7bit ASCII?

@ivarne
Copy link
Member

ivarne commented Aug 6, 2014

One might also argue that codepoints above 128 should trigger an error so that non ASCII users gets forced to provide locale?

@StefanKarpinski
Copy link
Member

The POSIX locale does not, however. It's not as thoroughly defined as one might like, but it's pretty clearly the thing we should be using as a default – you can read more about it here.

@elextr
Copy link

elextr commented Aug 6, 2014

The problem with C/POSIX is that its specified for the portable character set but it is unspecified for all other code points. To unsure locale and platform independence it is necessary to separate the use limited to ASCII from the use for any Unicode. Perhaps using toupper or tolower without any locale specified would use a built-in version that threw errors on non-ASCII characters, but passed to the implementation in package Unicode if "C" or "POSIX" is specified explicitly.

@jiahao
Copy link
Member Author

jiahao commented Aug 6, 2014

I withdraw my comment about en_US; apparently its definition and underlying character set is distro-specific. (en_US.ASCII has a closer definition to what I had in mind, but that's almost equivalent to C/POSIX and the latter is at least spec-ed.)

@catawbasam
Copy link
Contributor

+1 to @ivarne's suggestion regarding <128, i.e. the ASCII 'standard character set'.

@StefanKarpinski : http://www.ascii-codes.com/cp857.html for a description of the Turkish characters. According to this page, the differences apply only in the 128-255 range,
i.e. the extended character set.

While the extended character set is locale dependent, I don't believe the standard character set is.

@StefanKarpinski
Copy link
Member

The problem with C/POSIX is that its specified for the portable character set but it is unspecified for all other code points. To unsure locale and platform independence it is necessary to separate the use limited to ASCII from the use for any Unicode. Perhaps using toupper or tolower without any locale specified would use a built-in version that threw errors on non-ASCII characters, but passed to the implementation in package Unicode if "C" or "POSIX" is specified explicitly.

This isn't really a problem – we throw an error if someone does something in a locale that isn't well-defined in that locale. In the POSIX locale, which is the default one, we only support lowercase and uppercase (and various other locale-dependent functions) on characters in the "portable character set".

@nalimilan
Copy link
Member

The documentation for Perl's lc (lowercase) is interesting: http://perldoc.perl.org/functions/lc.html They appear to change only ASCII characters by default, leaving other characters as they are. For Julia, I agree that raising an error is more reasonable, at least by default. Calling use locale in Perl enables locale-dependent Unicode rules. This could be implemented in Julia as an argument to toupper in a Unicode package.

Python 3 seems to always use Unicode rules, but the effect of the locale is not specified.

@elextr
Copy link

elextr commented Aug 7, 2014

@StefanKarpinski just to be a little clearer:

The default behaviour of toupper and tolower should be locale and platform independent, defined for the POSIX portable character set only and throws errors for other code points.

This is distinct from a platform "C" or "POSIX" locale which is allowed to work for non-ASCII code points (its unspecified, not specified as an error). For example a platform may decide to return non-ASCII code points unchanged, and users of toupper and tolower may wish to use that behaviour so their program matches other software on the platform, thats why I distinguished between the default behaviour above (which is platform independent) and the explicit specification of "C" and "POSIX" using the platform definition.

Maybe the default can be called the "Julia" locale :)

@StefanKarpinski
Copy link
Member

These functions are called lowercase and uppercase, not tolower and toupper, everyone :-)

Whatever other systems may do for undefined behavior in C/POSIX locale is their business. The general policy in Julia is if something is undefined, doing it should be an error. There are exceptions, but they had better be for a very good reason.

@elextr
Copy link

elextr commented Aug 7, 2014

On 7 August 2014 11:16, Stefan Karpinski notifications@github.com wrote:

These functions are called lowercase and uppercase, not tolower and
toupper, everyone :-)

Baaa, just following the flock :-)

Then again maybe different behaviours justifies different names?

Whatever other systems may do for undefined behavior in C/POSIX locale is
their business. The general policy in Julia is if something is undefined,
doing it should be an error. There are exceptions, but they had better be
for a very good reason.

Agree, the default should follow Julia norms.

I'm just noting that Julia should not prevent the user from accessing
their platform "C/POSIX" behaviour by hiding it with its default. I
suggest that the most natural implementation is having explicit requests
for those locales use the platform implementation not the Julia defined
one, but YMMV.

I have spent considerable swearing time trying to fix problems where
software has (with the best of intentions) only allowed its definition of
locale specific features and so given differing results from other software
on the same platform, so maybe I'm a bit sensitive on this point.


Reply to this email directly or view it on GitHub
#7848 (comment).

@JeffBezanson
Copy link
Member

There is a compelling if slightly depressing argument to be made about following the platform libc however wrong it may be.

It's worth noting that UnicodeData.txt specifies default case mappings for all upper- and lowercase letters. Technically these fields are classified as informative, but they still form a standard of a kind. I'd wager at least one of our C libraries is using that data in towupper.

@elextr
Copy link

elextr commented Aug 7, 2014

@JeffBezanson the standard definition of libc toupper and tolower (where the names came from) are locale dependent, so libc is not suitable for the locale independent implementation.

UnicodeData.txt provides case mappings for characters so it can only return a single character as an answer like toupper and tolower.

As @StefanKarpinski pointed out, the OP was for lowercase and uppercase, and they are strings, so allowing different length results, as defined in Unicode SpecialCasing.txt. Some of this is locale independent (ß to SS) and some of which (the notorious Turkish "i" for example) is locale dependent.

ICU for example uses different APIs for locale dependent and locale independent transforms though of course its constrained by the capabilities of C. It is also one of the platforms that returns unchanged any character with no valid case transform, even for "C" locale.

So perhaps it would be best to define separate Julia specific locale independent APIs and more general locale specific APIs including the "C" locale.

Essentially there are valid use-cases for both and both need to be available.

@JeffBezanson
Copy link
Member

I meant that a given libc might be incorrectly using a locale-independent implementation anyway, not that it should. My libc does not obey the Turkish locale for uppercase "i" for example.

I agree both should be available. A function in Base, when not passed an argument mentioning locale, should probably be locale-independent by default. Locale-dependent functions should follow the platform we're on or another respectable source like ICU.

@jiahao
Copy link
Member Author

jiahao commented May 28, 2015

Closed in favor of #11471

@jiahao jiahao closed this as completed May 28, 2015
@ScottPJones
Copy link
Contributor

Pretty please reopen this... I think #11471 is really Windows specific breakage, and this is about handling locale dependent vs locale independent case mappings and foldings in Julia...

@stevengj
Copy link
Member

stevengj commented Jun 1, 2015

@ScottPJones, I think the conclusion was that Julia's Base behavior should remain as-is (locale-independent, now that it is using utf8proc based on the UnicodeData.txt "informative" standard), and that more sophisticated locale-dependent behavior belongs in an ICU package or similar.

@ScottPJones
Copy link
Contributor

OK, I see what you're getting at... it's not that the issue has been fixed, it's just that it shouldn't be fixed in Base. Do I have that right?

@stevengj
Copy link
Member

stevengj commented Jun 1, 2015

Yes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
unicode Related to unicode characters and encodings
Projects
None yet
Development

No branches or pull requests

9 participants