-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC: export utf8proc Unicode transformation functionality in Julia #5576
Conversation
On second thought, I think it's better to delay implementation of something like |
This all looks fantastic. Thanks for putting it all together. Let's think of what tests we should put in. On top of the very simplest of tests (checking the three new functions on some sample of Unicode strings), one potential concern is if |
I can see |
I suspect that functions like |
Yes, |
We know some things are certainly missing: #3721 |
Unfortunately there are quite a few code points for which julia> inconsistencies = [is_valid_utf8(string(char(a))) != is_valid_char((char(a))) for a in 0:0x10ffff];
julia> uint32(find(identity, inconsistencies))-1
2114-element Array{Uint32,1}:
0x0000d800
0x0000d801
0x0000d802
0x0000d803
0x0000d804
⋮
0x000effff
0x000ffffe
0x000fffff
0x0010fffe
0x0010ffff In all cases, |
The julia> a=string(char(0x0041),char(0x030a))
"Å"
julia> b=normalize_string(a)
"A"
julia> length(b)
1
julia> int(b[1])
65 It seems unlikely that this is what most users would want as a default; perhaps a better choice is to specify the flags which would make |
There is also the question of how we want to deal with future versions of Unicode standards. The current stable release of |
Further inconsistencies: julia> #Upper case identification
julia> inconsistencies = [isupper(char(a)) != (Base.UTF8proc.category_code(char(a))==Base.UTF8proc.category_code('A')) for a in 0:0x10ffff];
julia> uint32(find(identity, inconsistencies))-1
134-element Array{Uint32,1}:
0x0000023a
0x0000023b
0x0000023d
0x0000023e
0x00000241
⋮
0x00002ce0
0x00002ce2
0x00010426
0x00010427
0x0001d7ca
julia> #Lower case identification
julia> inconsistencies = [islower(char(a)) != (Base.UTF8proc.category_code(char(a))==Base.UTF8proc.category_code('a')) for a in 0:0x10ffff];
julia> uint32(find(identity, inconsistencies))-1
285-element Array{Uint32,1}:
0x00000221
0x00000234
0x00000235
0x00000236
0x00000237
⋮
0x0001044f
0x0001d4c1
0x0001d6a4
0x0001d6a5
0x0001d7cb
julia> #Digit identification
julia> inconsistencies = [isdigit(char(a)) != (Base.UTF8proc.category_code(char(a))==Base.UTF8proc.category_code('1')) for a in 0:0x10ffff];
julia> uint32(find(identity, inconsistencies))-1
280-element Array{Uint32,1}:
0x00000660
0x00000661
0x00000662
0x00000663
0x00000664
⋮
0x0001d7fb
0x0001d7fc
0x0001d7fd
0x0001d7fe
0x0001d7ff |
I could keep going with this, but I'll stop with perhaps the most nefarious of the lot: julia> #Whitespace identification
julia> inconsistencies = [isblank(char(a)) != (Base.UTF8proc.category_code(char(a))==Base.UTF8proc.category_code(' ')) for a in 0:0x10ffff];
julia> uint32(find(identity, inconsistencies))-1
18-element Array{Uint32,1}:
0x00000009
0x000000a0
0x00001680
0x0000180e
0x00002000
0x00002001
0x00002002
0x00002003
0x00002004
0x00002005
0x00002006
0x00002007
0x00002008
0x00002009
0x0000200a
0x0000202f
0x0000205f
0x00003000 |
(for a more interesting inspection, try |
|
I'm not sure if the question of whether a string is validly encoded UTF-8 can be entirely decoupled from knowledge of its characters. Maybe I'm just misunderstanding what
|
Any UTF-8 string constructed from
Chars are not bytes. I believe it is true that this routine does not respect the 0x10ffff limit. That could be added easily. Encoding is orthogonal to code points. Imagine you called |
Thanks for the clarification. Would directly constructing |
Yes. Just be aware that despite what they may say, the Unicode Consortium does not have the authority to ban the integer 0xd800. |
Ok, so I guess the question now is how to resolve the inconstencies in utf8proc and libc's isw* functions. |
glibc seems to be quite out of date in this regard. They seem to have a couple years of backlog in updating their unicode tables, for example https://sourceware.org/bugzilla/show_bug.cgi?id=14010. |
@jiahao, the default of |
A no-op default sounds reasonable. |
@jiahao, I did a few spot checks on the In the case of |
The good news is that |
In addition to the issue referenced by @JeffBezanson above, glibc#14094 also alludes to an incomplete implementation of Unicode character typing. How much of an issue would it be to replace the |
It seems fine to me to use Unicode character classes for this sort of thing, as long as it is documented (with some extensions to count certain control characters as "spaces"/"blanks"); more sensible than maintaining backward compatibility with pre-Unicode conventions from K&R. We could define |
However, I think that changing the behavior of |
You could also add some of the tests I transcribed from UAX15 here |
Upon reflection, it makes more sense to me to make |
Regarding the libc functions, there is also the problem that |
@JeffBezanson, so |
Not sure why Travis is suddenly failing with |
It only checks UTF-8 byte stream syntax: whether it is possible to reconstruct a sequence of 32-bit integers from the bytes, with no over-long sequences. Surrogates are only used in UTF-16. The function only deals with issues unique to the UTF-8 encoding. This kind of validation is needed before one can even talk about which code points are valid, since if the byte stream is not well-formed you don't even know which alleged code points are there. |
I've tested this branch separately on my Macbook and on julia.mit and the tests pass. I think this is ready to merge. |
RFC: export utf8proc Unicode transformation functionality in Julia
I don't think we should document the |
We should only provide Unicode processing functionality that is specified by an international standard? |
Not sure if joking, or serious... |
On my machine,
|
@timholy, what is your machine? That function should be linked into libjulia via libutf8proc. |
@timholy This produced an error on Travis also, but I was unable to reproduce this on my machines. Would be great to track this one down. |
@nolta, being more serious, I would like to see a better argument for removing (or hiding) some functionality, on a case-by-case basis, than "it's nonstandard". For example, removing diacriticals from unicode strings (ñ → n, etc) is a common need (google "unicode remove diacritical") and many libraries (e.g. ICU) provide this functionality as well. (Moreover, what utf8proc does can be formally defined fairly easily: perform the canonical decomposition and delete characters in classes Mn, Mc, or Me). |
|
@timholy, can you do |
You beat me to it:
|
Maybe someone specified the |
@stevengj Fair enough. If each option has a well-defined meaning independent of the utf8proc library, then i'm ok with exposing them. I'm pretty sure, however, that the |
Following up to #5462 and #5434, this patch exposes a bunch of functionality from the bundled utf8proc library in Julia:
is_valid_char(s)
is replaced by the implementation in utf8proc, which detects a few more invalid codepointsis_assigned_char(c)
: new function to return whether a code point is assignedcharcategory(c)
: new function to return aBase.UnicodeCategory
describing the Unicode general category of the character. I defined a type for this to make it easy to get the general category or the subcategory parts, and for pretty printing, but I expect that most people will just compare these to strings (e.g."Lu"
for uppercase letter).normalize_string(s::String, normalform::Symbol)
: new function to perform standard Unicode normalization, e.g.normalize_string(s, :NFKC)
for normal form KC.normalize_string(s::String; keywords...)
: new function to perform various normalization transformations, including case folding, diacritical stripping, newline conversion, etcetera.cc: @jiahao