switch to utf8proc's upper/lowercase functions #11493

stevengj · 2015-05-30T03:57:17Z

This PR uses utf8proc rather than the system's towlower and towupper. (Fixes #11471) Rationale:

Correct on Windows (outside of the BMP, the Windows functions are inevitably wrong since they use a 16-bit wchar_t).
Portable — OS-independent results.
Up-to-date with Unicode 7. (On the latest MacOS X, the system functions are wrong for about 2000 codepoints.)

The results are also locale-independent, so they don't address the request in #7848 for more sophisticated, context- and locale-dependent results. (I tend to agree with Jeff: that belongs in an external package like ICU.)

(@ScottPJones wants to rewrite the relevant portions of utf8proc in pure Julia. That may take a while, though, and I don't think it makes sense to wait on that to fix this when the data is already trivially available in utf8proc.)

(While I agree with @JeffBezanson that many non-ASCII uses of lower/uppercase are probably mistakes from people who should really be using normalization+casefolding, I agree with @StefanKarpinski that uppercase and lowercase are simply functions that people expect to have in any string library, and we might as well implement them correctly.)

stevengj · 2015-05-30T04:09:42Z

As an incidental benefit, the new functions seem to be considerably faster on my machine (MacOS) for non-ASCII chars. For

function foo(c, n)
    for i = 1:n
        c = uppercase(c)
    end
end

I get

julia> @time foo('a', 10^7)
  42.339 milliseconds (5 allocations: 160 bytes)

julia> @time foo('α', 10^7)
  82.697 milliseconds (5 allocations: 160 bytes)

with the new version and

julia> @time foo('a', 10^7)
  48.216 milliseconds (5 allocations: 160 bytes)

julia> @time foo('α', 10^7)
 226.551 milliseconds (5 allocations: 160 bytes)

with the old (libc) version. (Although obviously the latter is platform-dependent.)

ScottPJones · 2015-05-30T07:02:15Z

👍!

ScottPJones · 2015-05-30T07:29:20Z

I didn't mean to imply when I said that about rewriting to not use utf8proc that fixing this issue using utf8proc was not a very good thing... just didn't want you to get bothered if later on something I do supersedes this... this also has the benefit of being easily back portable to v0.3.x, n'est pas?

tkelman · 2015-05-30T08:35:37Z

base/utf8proc.jl

@@ -121,6 +121,12 @@ end

 charwidth(c::Char) = Int(ccall(:utf8proc_charwidth, Cint, (UInt32,), c))

+# faster x+y that does no overflow checking
+p(x::Char, y::UInt32) = reinterpret(Char, reinterpret(UInt32, x) + y)


Can this please not be a 1-letter function name? I know it's not exported, but still...

yep... I agree with @tkelman, no 1-letter (or 1 letterlike number such as 𝟙 😀!) function names...
Also... this function wouldn't have been necessary if you all had agreed to my PR #11103! 😀 Maybe you could reopen #11103, and make this a lot cleaner?

tkelman · 2015-05-30T09:31:54Z

this also has the benefit of being easily back portable to v0.3.x, n'est pas?

Sort of but it's not quite a direct cherry-pick, since release-0.3 is still using the last PSG release, 1.1.6, of utf8proc. Not sure whether upgrading the dependency library would change any user-visible code results, aside from fixing this particular bug.

ScottPJones · 2015-05-30T09:43:11Z

Ah... didn't know that... (I never use v0.3.x [except occasionally to see if something is a regression])
I always have lived on the bleeding edge! (like using a new language, with major changes occurring, to base my future livelihood on! 😀)

ScottPJones · 2015-05-30T09:44:54Z

Could 0.3.9 be made to use a patch (?) (maintenance release?) of utf8proc?, i.e. just 1.1.6 plus @stevengj's changes?

tkelman · 2015-05-30T09:51:17Z

No, because I'll be tagging 0.3.9 in an hour or two. Upgrading for 0.3.10 might be a possibility but would want to test thoroughly. Completely updating the unicode tables might be too drastic for a patch release.

ScottPJones · 2015-05-30T10:03:49Z

Ah... OK, didn't realize that 0.3.9 was already ready, should have said 0.3.next ... and didn't realize that it would involve new Unicode tables... never mind then (I'm not going to use 0.3.x anyway 😀)

stevengj · 2015-05-30T13:20:04Z

@ScottPJones, I suspect that the right thing to do is to just change Char - Char and Char ± Int32 so that they do not check for overflow by default, consistent with Julia's treatment of fixed-width integer arithmetic and its typical assumption that character data is valid except when conversions are performed. However, that should be a separate PR.

ScottPJones · 2015-05-30T14:50:24Z

@stevengj I think Char ± Int32 is incorrect now for another reason... it should return an Integer, not a Char, so that when/if we have Char always be validated, it cannot be used to bypass the abstraction...

tkelman · 2015-05-30T15:13:05Z

This needs to update the checksums via NO_GIT. It turns out the branch-tracking behavior of @vtjnash's git-externals is a bad idea, ref #10743 (comment) - we should point to tags, or strictly sha's. Note the incredibly easy to miss "SHA mismatch" warning? The broken build change on master of utf8proc breaks the Windows build here. I'm a little surprised that it doesn't break Travis too.

stevengj · 2015-05-30T22:38:45Z

@tkelman, could you elaborate? I thought the utf8proc.version file was just a branch name and a commit hash. What else do I need to do? make NO_GIT=1 or something? This is confusing.

The Makefile change on utf8proc master HEAD shouldn't be relevant in this patch, since I am not pulling from HEAD, no?

vtjnash · 2015-05-30T22:49:54Z

It was not intended for the tracked brach to be named "master" or "develop". It should either be a stable release brach or (typically) a tag.

stevengj · 2015-05-31T12:28:13Z

I see, so the SHA is not actually a commit in that branch to use, it is just a redundant checksum; you always check out the tip of that branch? Hopefully it is fixed now.

stevengj · 2015-05-31T16:07:51Z

Yay, tests passing.

ScottPJones · 2015-05-31T16:55:22Z

Will this also make it pick up my bug fixes in utf8proc?
Good change, even if not

stevengj · 2015-05-31T17:23:22Z

@ScottPJones, yes, it's updating to the latest utf8proc version, which includes your fix for noncharacters.

ScottPJones · 2015-05-31T17:31:08Z

Thanks!

tkelman · 2015-06-01T07:33:31Z

You need to do make -C deps install-utf8proc NO_GIT=1 to generate new checksums. Should also delete the old checksums for 1.2.

stevengj · 2015-06-01T10:07:35Z

Grr, why is the process of updating the dependency so annoying?

ScottPJones · 2015-06-01T10:10:38Z

Hehe... I'm glad you are working out the kinks of getting utf8proc updated, instead of me!

tkelman · 2015-06-01T10:10:54Z

Probably because we're stretching gmake a bit further than it should really be taken and rolling our own thing here. Which we also need to document better, and fix up so the sha is the driver rather than the branch.

…ixes JuliaLang#11471)

stevengj · 2015-06-01T11:53:27Z

Okay, updated the checksums; this is the first time I've noticed the deps/checksums directory.

tkelman · 2015-06-01T12:04:12Z

Guess we didn't advertise it much, but it's been used for quite a few months now to verify integrity of downloaded tarballs.

Looks good to merge, assuming green CI.

switch to utf8proc's upper/lowercase functions

ScottPJones · 2015-06-01T13:25:12Z

Yeah!

LAPACK 3.5 is required since #14389. utf8proc 1.3 since #11493. [av skip]

stevengj added the unicode Related to unicode characters and encodings label May 30, 2015

tkelman reviewed May 30, 2015
View reviewed changes

stevengj force-pushed the utf8proc_case branch from 37ec3d6 to 89061b2 Compare May 30, 2015 12:38

stevengj force-pushed the utf8proc_case branch from 89061b2 to becc6eb Compare May 31, 2015 12:27

switch to utf8proc's portable, up-to-date, upper/lowercase functions (f…

2945c84

…ixes JuliaLang#11471)

stevengj force-pushed the utf8proc_case branch from becc6eb to 2945c84 Compare June 1, 2015 11:52

stevengj added a commit that referenced this pull request Jun 1, 2015

Merge pull request #11493 from stevengj/utf8proc_case

0623698

switch to utf8proc's upper/lowercase functions

stevengj merged commit 0623698 into JuliaLang:master Jun 1, 2015

stevengj deleted the utf8proc_case branch June 1, 2015 13:13

tkelman mentioned this pull request Jun 3, 2015

uppercase/lowercase functions are not portable? #11471

Closed

nalimilan added a commit that referenced this pull request Jan 7, 2016

Document dependency on LAPACK >= 3.5 and utf8proc >= 1.3

fa96177

LAPACK 3.5 is required since #14389. utf8proc 1.3 since #11493. [av skip]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

switch to utf8proc's upper/lowercase functions #11493

switch to utf8proc's upper/lowercase functions #11493

stevengj commented May 30, 2015

stevengj commented May 30, 2015

ScottPJones commented May 30, 2015

ScottPJones commented May 30, 2015

tkelman May 30, 2015

ScottPJones May 30, 2015

tkelman commented May 30, 2015

ScottPJones commented May 30, 2015

ScottPJones commented May 30, 2015

tkelman commented May 30, 2015

ScottPJones commented May 30, 2015

stevengj commented May 30, 2015

ScottPJones commented May 30, 2015

tkelman commented May 30, 2015

stevengj commented May 30, 2015

vtjnash commented May 30, 2015

stevengj commented May 31, 2015

stevengj commented May 31, 2015

ScottPJones commented May 31, 2015

stevengj commented May 31, 2015

ScottPJones commented May 31, 2015

tkelman commented Jun 1, 2015

stevengj commented Jun 1, 2015

ScottPJones commented Jun 1, 2015

tkelman commented Jun 1, 2015

stevengj commented Jun 1, 2015

tkelman commented Jun 1, 2015

ScottPJones commented Jun 1, 2015

switch to utf8proc's upper/lowercase functions #11493

switch to utf8proc's upper/lowercase functions #11493

Conversation

stevengj commented May 30, 2015

stevengj commented May 30, 2015

ScottPJones commented May 30, 2015

ScottPJones commented May 30, 2015

tkelman May 30, 2015

Choose a reason for hiding this comment

ScottPJones May 30, 2015

Choose a reason for hiding this comment

tkelman commented May 30, 2015

ScottPJones commented May 30, 2015

ScottPJones commented May 30, 2015

tkelman commented May 30, 2015

ScottPJones commented May 30, 2015

stevengj commented May 30, 2015

ScottPJones commented May 30, 2015

tkelman commented May 30, 2015

stevengj commented May 30, 2015

vtjnash commented May 30, 2015

stevengj commented May 31, 2015

stevengj commented May 31, 2015

ScottPJones commented May 31, 2015

stevengj commented May 31, 2015

ScottPJones commented May 31, 2015

tkelman commented Jun 1, 2015

stevengj commented Jun 1, 2015

ScottPJones commented Jun 1, 2015

tkelman commented Jun 1, 2015

stevengj commented Jun 1, 2015

tkelman commented Jun 1, 2015

ScottPJones commented Jun 1, 2015