-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
switch to utf8proc's upper/lowercase functions #11493
Conversation
As an incidental benefit, the new functions seem to be considerably faster on my machine (MacOS) for non-ASCII chars. For function foo(c, n)
for i = 1:n
c = uppercase(c)
end
end I get
with the new version and
with the old (libc) version. (Although obviously the latter is platform-dependent.) |
👍! |
I didn't mean to imply when I said that about rewriting to not use |
@@ -121,6 +121,12 @@ end | |||
|
|||
charwidth(c::Char) = Int(ccall(:utf8proc_charwidth, Cint, (UInt32,), c)) | |||
|
|||
# faster x+y that does no overflow checking | |||
p(x::Char, y::UInt32) = reinterpret(Char, reinterpret(UInt32, x) + y) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this please not be a 1-letter function name? I know it's not exported, but still...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sort of but it's not quite a direct cherry-pick, since release-0.3 is still using the last PSG release, 1.1.6, of utf8proc. Not sure whether upgrading the dependency library would change any user-visible code results, aside from fixing this particular bug. |
Ah... didn't know that... (I never use v0.3.x [except occasionally to see if something is a regression]) |
Could 0.3.9 be made to use a patch (?) (maintenance release?) of utf8proc?, i.e. just 1.1.6 plus @stevengj's changes? |
No, because I'll be tagging 0.3.9 in an hour or two. Upgrading for 0.3.10 might be a possibility but would want to test thoroughly. Completely updating the unicode tables might be too drastic for a patch release. |
Ah... OK, didn't realize that 0.3.9 was already ready, should have said 0.3.next ... and didn't realize that it would involve new Unicode tables... never mind then (I'm not going to use 0.3.x anyway 😀) |
@ScottPJones, I suspect that the right thing to do is to just change |
@stevengj I think |
This needs to update the checksums via |
@tkelman, could you elaborate? I thought the The |
It was not intended for the tracked brach to be named "master" or "develop". It should either be a stable release brach or (typically) a tag. |
I see, so the SHA is not actually a commit in that branch to use, it is just a redundant checksum; you always check out the tip of that branch? Hopefully it is fixed now. |
Yay, tests passing. |
Will this also make it pick up my bug fixes in |
@ScottPJones, yes, it's updating to the latest utf8proc version, which includes your fix for noncharacters. |
Thanks! |
You need to do |
Grr, why is the process of updating the dependency so annoying? |
Hehe... I'm glad you are working out the kinks of getting utf8proc updated, instead of me! |
Probably because we're stretching gmake a bit further than it should really be taken and rolling our own thing here. Which we also need to document better, and fix up so the sha is the driver rather than the branch. |
Okay, updated the checksums; this is the first time I've noticed the |
Guess we didn't advertise it much, but it's been used for quite a few months now to verify integrity of downloaded tarballs. Looks good to merge, assuming green CI. |
switch to utf8proc's upper/lowercase functions
Yeah! |
This PR uses utf8proc rather than the system's
towlower
andtowupper
. (Fixes #11471) Rationale:wchar_t
).The results are also locale-independent, so they don't address the request in #7848 for more sophisticated, context- and locale-dependent results. (I tend to agree with Jeff: that belongs in an external package like ICU.)
(@ScottPJones wants to rewrite the relevant portions of utf8proc in pure Julia. That may take a while, though, and I don't think it makes sense to wait on that to fix this when the data is already trivially available in utf8proc.)
(While I agree with @JeffBezanson that many non-ASCII uses of lower/uppercase are probably mistakes from people who should really be using normalization+casefolding, I agree with @StefanKarpinski that
uppercase
andlowercase
are simply functions that people expect to have in any string library, and we might as well implement them correctly.)