Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

switch to utf8proc's upper/lowercase functions #11493

Merged
merged 1 commit into from
Jun 1, 2015

Conversation

stevengj
Copy link
Member

This PR uses utf8proc rather than the system's towlower and towupper. (Fixes #11471) Rationale:

  • Correct on Windows (outside of the BMP, the Windows functions are inevitably wrong since they use a 16-bit wchar_t).
  • Portable — OS-independent results.
  • Up-to-date with Unicode 7. (On the latest MacOS X, the system functions are wrong for about 2000 codepoints.)

The results are also locale-independent, so they don't address the request in #7848 for more sophisticated, context- and locale-dependent results. (I tend to agree with Jeff: that belongs in an external package like ICU.)

(@ScottPJones wants to rewrite the relevant portions of utf8proc in pure Julia. That may take a while, though, and I don't think it makes sense to wait on that to fix this when the data is already trivially available in utf8proc.)

(While I agree with @JeffBezanson that many non-ASCII uses of lower/uppercase are probably mistakes from people who should really be using normalization+casefolding, I agree with @StefanKarpinski that uppercase and lowercase are simply functions that people expect to have in any string library, and we might as well implement them correctly.)

@stevengj stevengj added the unicode Related to unicode characters and encodings label May 30, 2015
@stevengj
Copy link
Member Author

As an incidental benefit, the new functions seem to be considerably faster on my machine (MacOS) for non-ASCII chars. For

function foo(c, n)
    for i = 1:n
        c = uppercase(c)
    end
end

I get

julia> @time foo('a', 10^7)
  42.339 milliseconds (5 allocations: 160 bytes)

julia> @time foo('α', 10^7)
  82.697 milliseconds (5 allocations: 160 bytes)

with the new version and

julia> @time foo('a', 10^7)
  48.216 milliseconds (5 allocations: 160 bytes)

julia> @time foo('α', 10^7)
 226.551 milliseconds (5 allocations: 160 bytes)

with the old (libc) version. (Although obviously the latter is platform-dependent.)

@ScottPJones
Copy link
Contributor

👍!

@ScottPJones
Copy link
Contributor

I didn't mean to imply when I said that about rewriting to not use utf8proc that fixing this issue using utf8proc was not a very good thing... just didn't want you to get bothered if later on something I do supersedes this... this also has the benefit of being easily back portable to v0.3.x, n'est pas?

@@ -121,6 +121,12 @@ end

charwidth(c::Char) = Int(ccall(:utf8proc_charwidth, Cint, (UInt32,), c))

# faster x+y that does no overflow checking
p(x::Char, y::UInt32) = reinterpret(Char, reinterpret(UInt32, x) + y)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this please not be a 1-letter function name? I know it's not exported, but still...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep... I agree with @tkelman, no 1-letter (or 1 letterlike number such as 𝟙 😀!) function names...
Also... this function wouldn't have been necessary if you all had agreed to my PR #11103! 😀 Maybe you could reopen #11103, and make this a lot cleaner?

@tkelman
Copy link
Contributor

tkelman commented May 30, 2015

this also has the benefit of being easily back portable to v0.3.x, n'est pas?

Sort of but it's not quite a direct cherry-pick, since release-0.3 is still using the last PSG release, 1.1.6, of utf8proc. Not sure whether upgrading the dependency library would change any user-visible code results, aside from fixing this particular bug.

@ScottPJones
Copy link
Contributor

Ah... didn't know that... (I never use v0.3.x [except occasionally to see if something is a regression])
I always have lived on the bleeding edge! (like using a new language, with major changes occurring, to base my future livelihood on! 😀)

@ScottPJones
Copy link
Contributor

Could 0.3.9 be made to use a patch (?) (maintenance release?) of utf8proc?, i.e. just 1.1.6 plus @stevengj's changes?

@tkelman
Copy link
Contributor

tkelman commented May 30, 2015

No, because I'll be tagging 0.3.9 in an hour or two. Upgrading for 0.3.10 might be a possibility but would want to test thoroughly. Completely updating the unicode tables might be too drastic for a patch release.

@ScottPJones
Copy link
Contributor

Ah... OK, didn't realize that 0.3.9 was already ready, should have said 0.3.next ... and didn't realize that it would involve new Unicode tables... never mind then (I'm not going to use 0.3.x anyway 😀)

@stevengj
Copy link
Member Author

@ScottPJones, I suspect that the right thing to do is to just change Char - Char and Char ± Int32 so that they do not check for overflow by default, consistent with Julia's treatment of fixed-width integer arithmetic and its typical assumption that character data is valid except when conversions are performed. However, that should be a separate PR.

@ScottPJones
Copy link
Contributor

@stevengj I think Char ± Int32 is incorrect now for another reason... it should return an Integer, not a Char, so that when/if we have Char always be validated, it cannot be used to bypass the abstraction...

@tkelman
Copy link
Contributor

tkelman commented May 30, 2015

This needs to update the checksums via NO_GIT. It turns out the branch-tracking behavior of @vtjnash's git-externals is a bad idea, ref #10743 (comment) - we should point to tags, or strictly sha's. Note the incredibly easy to miss "SHA mismatch" warning? The broken build change on master of utf8proc breaks the Windows build here. I'm a little surprised that it doesn't break Travis too.

@stevengj
Copy link
Member Author

@tkelman, could you elaborate? I thought the utf8proc.version file was just a branch name and a commit hash. What else do I need to do? make NO_GIT=1 or something? This is confusing.

The Makefile change on utf8proc master HEAD shouldn't be relevant in this patch, since I am not pulling from HEAD, no?

@vtjnash
Copy link
Member

vtjnash commented May 30, 2015

It was not intended for the tracked brach to be named "master" or "develop". It should either be a stable release brach or (typically) a tag.

@stevengj
Copy link
Member Author

I see, so the SHA is not actually a commit in that branch to use, it is just a redundant checksum; you always check out the tip of that branch? Hopefully it is fixed now.

@stevengj
Copy link
Member Author

Yay, tests passing.

@ScottPJones
Copy link
Contributor

Will this also make it pick up my bug fixes in utf8proc?
Good change, even if not

@stevengj
Copy link
Member Author

@ScottPJones, yes, it's updating to the latest utf8proc version, which includes your fix for noncharacters.

@ScottPJones
Copy link
Contributor

Thanks!

@tkelman
Copy link
Contributor

tkelman commented Jun 1, 2015

You need to do make -C deps install-utf8proc NO_GIT=1 to generate new checksums. Should also delete the old checksums for 1.2.

@stevengj
Copy link
Member Author

stevengj commented Jun 1, 2015

Grr, why is the process of updating the dependency so annoying?

@ScottPJones
Copy link
Contributor

Hehe... I'm glad you are working out the kinks of getting utf8proc updated, instead of me!

@tkelman
Copy link
Contributor

tkelman commented Jun 1, 2015

Probably because we're stretching gmake a bit further than it should really be taken and rolling our own thing here. Which we also need to document better, and fix up so the sha is the driver rather than the branch.

@stevengj
Copy link
Member Author

stevengj commented Jun 1, 2015

Okay, updated the checksums; this is the first time I've noticed the deps/checksums directory.

@tkelman
Copy link
Contributor

tkelman commented Jun 1, 2015

Guess we didn't advertise it much, but it's been used for quite a few months now to verify integrity of downloaded tarballs.

Looks good to merge, assuming green CI.

stevengj added a commit that referenced this pull request Jun 1, 2015
switch to utf8proc's upper/lowercase functions
@stevengj stevengj merged commit 0623698 into JuliaLang:master Jun 1, 2015
@stevengj stevengj deleted the utf8proc_case branch June 1, 2015 13:13
@ScottPJones
Copy link
Contributor

Yeah!

nalimilan added a commit that referenced this pull request Jan 7, 2016
LAPACK 3.5 is required since #14389. utf8proc 1.3 since #11493.

[av skip]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
unicode Related to unicode characters and encodings
Projects
None yet
Development

Successfully merging this pull request may close these issues.

uppercase/lowercase functions are not portable?
4 participants