Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

uppercase/lowercase functions are not portable? #11471

Closed
stevengj opened this issue May 28, 2015 · 22 comments · Fixed by JuliaStrings/utf8proc#40 or #11493
Closed

uppercase/lowercase functions are not portable? #11471

stevengj opened this issue May 28, 2015 · 22 comments · Fixed by JuliaStrings/utf8proc#40 or #11493
Labels
unicode Related to unicode characters and encodings

Comments

@stevengj
Copy link
Member

Since @ScottPJones was mentioning upper/lowercase functions recently, I took a quick look at them and I noticed that we are calling towupper and towlower, which are C99 functions that accept wchar_t arguments.

Unfortunately, this means that they are broken on Windows (where wchar_t is 16 bits) for any character outside the BMP. Even on other platforms with a 32-bit wchar_t, they are going to return different results on different systems, and many systems will have out-of-date Unicode tables. They are also locale-dependent; I'm not sure if this is desirable for us.

utf8proc has up-to-date upper/lower/titlecase mapping data already in its "database" (generated from http://www.unicode.org/Public/UNIDATA/UnicodeData.txt), so maybe we should just add a utf8proc_toupper function (etc.) to utf8proc to make this accessible. Then we could call that (probably plus a check for the common case of ASCII codepoints).

@stevengj stevengj added the unicode Related to unicode characters and encodings label May 28, 2015
@ScottPJones
Copy link
Contributor

Yep... although I wouldn't solve this with utf8proc... I'd intended to tackle this shortly (it's something else we need for our project, and I'd prefer to use Julia as much as possible).

@JeffBezanson
Copy link
Member

although I wouldn't solve this with utf8proc

Why not? If utf8proc does this wrong, we should just fix it instead of re-implementing.

@ScottPJones
Copy link
Contributor

Because, I believe it can better be implemented in pure Julia (sorry, now I'm a convert!)

@jiahao
Copy link
Member

jiahao commented May 28, 2015

Essentially a duplicate of #7848

The Unicode standard defines upper-lower case mappings on a per character basis, but a correct transformation must necessarily take into account the locale. (See the infamous Turkish i - İ, ı - I pairings as the most egregious examples.)

@jiahao jiahao closed this as completed May 28, 2015
@ScottPJones
Copy link
Contributor

I don't think this really is a duplicate of #7848, and should be reopened.
If you look at ICU (probably the most widely used library for this sort of stuff), you can do either locale dependent or locale independent case mappings...
I think that in Base, Julia should give consistent results on all platforms, using locale independent mappings.
I think fancy locale dependent mappings, as discussed in #7848, belong in a package (I'll have to look, they may already be handled by the ICU.jl package).

@jiahao
Copy link
Member

jiahao commented May 28, 2015

Did you read the discussion in #7848? The conclusion was exactly what you said, that we should have a locale independent choice in Base and have the locale specific choices not in base.

@ScottPJones
Copy link
Contributor

I missed the last couple of comments (I'm in a meeting right now!) Sorry, my fault totally... The only difference is that I think any locale dependent mappings should be done as extending the Base methods via a Package...

@ScottPJones
Copy link
Contributor

Also, I think @stevengj's point was something else, and this may need to still be open... it was about characters outside the BMP not being handled (even in a local independent fashion), on Windows,
simply due to the way the functions are implemented...

@jiahao
Copy link
Member

jiahao commented May 28, 2015

If it really makes you happier, I'll close #7848 in favor of this issue.

@ScottPJones
Copy link
Contributor

Actually, I don't think #7848 needs to be closed either 😀
IMO, this one is about non-BMP case mapping being broken on Windows (good catch, @stevengj),
and #7848 is about the need for having both locale independent (in Base) and locale dependent (hopefully in a package outside of Base).

@stevengj
Copy link
Member Author

Pure Julia code for this has its pluses and minuses. On the minus side, it is more work: this information is already in utf8proc, and adding a couple of functions to expose it to Julia is literally something like 6 lines of C code. Also on the minus side, having it in utf8proc makes it available to non-Julia users. On the plus side, writing code in Julia is more fun, and potentially allows for more optimizations (e.g. inlining).

For my own part, I tend to default to the path of least work.

@ScottPJones
Copy link
Contributor

About the minuses... more work, well, maybe, but it's something that I'd like to do when I want to take a break from other stuff 😀 (i.e. the Julia is fun part), plus doing it in Julia helps improve my Julia skills, and finally, I'm not talking at all about removing anything from utf8proc, so utf8proc is still available to non-Julia users (how many are there?)
About the pluses... I've been thinking about ways to make the size of the data structures much smaller,
and to write the code to generate the data structures (probably into a C source code file) in Julia also...
[I posted on julia-users, https://groups.google.com/forum/#!topic/julia-users/CyrLc_E0dac, to try to get some help to learn the best "julian" way of doing that, but nobody responded... :sad:]

@stevengj
Copy link
Member Author

Actually, it looks like the towupper functions etcetera take a wint_t, even on Windows, which in my understanding is at least 32 bits wide. In which case the Julia ccall is wrong on Windows.

But I'm confused, because we have test coverage of lowercase etc... if we are passing the wrong type on Windows, how could we not have noticed? Does sizeof(wint_t) == 2 on Windows?

@stevengj
Copy link
Member Author

Sorry, I didn't realize that (for #....) would close an issue on github; I thought it had to be fixed or closes or similar.

@stevengj
Copy link
Member Author

By the way, shouldn't ucfirst be using titlecase rather than uppercase for the first letter?

@ScottPJones
Copy link
Contributor

Yes about using titlecase

@ScottPJones
Copy link
Contributor

BTW, could you reopen #7848, which really is separate, and isn't addressed by your nice fix to this issue?

@jiahao
Copy link
Member

jiahao commented May 30, 2015

I don't see the need. We already decided that locale-specific transformations should not belong in base, which is the remaining part of that issue.

@nalimilan
Copy link
Member

@stevengj I'd have preferred we wait for a true utf8proc release before relying on it. Now I need to package this development version to build the nightlies...

@tkelman
Copy link
Contributor

tkelman commented Jun 3, 2015

There is a tag https://github.com/JuliaLang/utf8proc/releases/tag/1.3-dev1 but it's got a -dev1 marker on it, should we promote that to 1.3.0?

@nalimilan
Copy link
Member

Well, not before we sort out JuliaStrings/utf8proc#42. :-)

@tkelman
Copy link
Contributor

tkelman commented Jun 3, 2015

It would also be friendlier to distro packagers to put this change (#11493) in under a conditional utf8proc version number check.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
unicode Related to unicode characters and encodings
Projects
None yet
6 participants