-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
uppercase/lowercase functions are not portable? #11471
Comments
Yep... although I wouldn't solve this with utf8proc... I'd intended to tackle this shortly (it's something else we need for our project, and I'd prefer to use Julia as much as possible). |
Why not? If utf8proc does this wrong, we should just fix it instead of re-implementing. |
Because, I believe it can better be implemented in pure Julia (sorry, now I'm a convert!) |
Essentially a duplicate of #7848 The Unicode standard defines upper-lower case mappings on a per character basis, but a correct transformation must necessarily take into account the locale. (See the infamous Turkish i - İ, ı - I pairings as the most egregious examples.) |
I don't think this really is a duplicate of #7848, and should be reopened. |
Did you read the discussion in #7848? The conclusion was exactly what you said, that we should have a locale independent choice in Base and have the locale specific choices not in base. |
I missed the last couple of comments (I'm in a meeting right now!) Sorry, my fault totally... The only difference is that I think any locale dependent mappings should be done as extending the Base methods via a Package... |
Also, I think @stevengj's point was something else, and this may need to still be open... it was about characters outside the BMP not being handled (even in a local independent fashion), on Windows, |
If it really makes you happier, I'll close #7848 in favor of this issue. |
Pure Julia code for this has its pluses and minuses. On the minus side, it is more work: this information is already in utf8proc, and adding a couple of functions to expose it to Julia is literally something like 6 lines of C code. Also on the minus side, having it in utf8proc makes it available to non-Julia users. On the plus side, writing code in Julia is more fun, and potentially allows for more optimizations (e.g. inlining). For my own part, I tend to default to the path of least work. |
About the minuses... more work, well, maybe, but it's something that I'd like to do when I want to take a break from other stuff 😀 (i.e. the Julia is fun part), plus doing it in Julia helps improve my Julia skills, and finally, I'm not talking at all about removing anything from |
Actually, it looks like the But I'm confused, because we have test coverage of |
Sorry, I didn't realize that |
By the way, shouldn't |
Yes about using titlecase |
BTW, could you reopen #7848, which really is separate, and isn't addressed by your nice fix to this issue? |
I don't see the need. We already decided that locale-specific transformations should not belong in base, which is the remaining part of that issue. |
@stevengj I'd have preferred we wait for a true utf8proc release before relying on it. Now I need to package this development version to build the nightlies... |
There is a tag https://github.com/JuliaLang/utf8proc/releases/tag/1.3-dev1 but it's got a -dev1 marker on it, should we promote that to 1.3.0? |
Well, not before we sort out JuliaStrings/utf8proc#42. :-) |
It would also be friendlier to distro packagers to put this change (#11493) in under a conditional utf8proc version number check. |
Since @ScottPJones was mentioning upper/lowercase functions recently, I took a quick look at them and I noticed that we are calling
towupper
andtowlower
, which are C99 functions that acceptwchar_t
arguments.Unfortunately, this means that they are broken on Windows (where
wchar_t
is 16 bits) for any character outside the BMP. Even on other platforms with a 32-bitwchar_t
, they are going to return different results on different systems, and many systems will have out-of-date Unicode tables. They are also locale-dependent; I'm not sure if this is desirable for us.utf8proc has up-to-date upper/lower/titlecase mapping data already in its "database" (generated from http://www.unicode.org/Public/UNIDATA/UnicodeData.txt), so maybe we should just add a
utf8proc_toupper
function (etc.) to utf8proc to make this accessible. Then we could call that (probably plus a check for the common case of ASCII codepoints).The text was updated successfully, but these errors were encountered: