-
-
Notifications
You must be signed in to change notification settings - Fork 5.6k
Description
Right now, it simply checks whether the given character is in one of the L categories (isalpha(c::AbstractChar) = UTF8PROC_CATEGORY_LU <= category_code(c) <= UTF8PROC_CATEGORY_LO
). This is almost correct, except that the Unicode Alphabetic
property belongs to these categories, to a Nl
category (number-like letters, eg. Roman numerals), and crucially to a set of characters defined to be Other_Alphabetic
that live in Mc
and Mn
(spacing and non-spacing marks). A lot of codepoints in Indic texts, for eg. most occurrences of vowels in Tamil texts, are characters found in this Other_Alphabetic
list.
Among the few other (programming) languages I tried this check on, Ruby (\p{Alpha}
) and Java (Character.isAlphabetic
) get this right (Java documentation explicitly explains the Alphabetic
property, Python 2 and 3 both ("அதிகாலை".isalpha()
) seem to be getting it wrong. Perl also gets the Other_Alphabetic
characters correctly identified under \p{Alpha}
(though it also seems to have additional magic on top).
Other_Alphabetic
apparently belongs to 1300 code points according to the Unicode PropList, so there are letters from quite a few language scripts that currently fail isalpha
.
I'm not sure if utf8proc
supports querying for either the Alphabetic
or the Other_Alphabetic
property (the utf8proc_property_struct
doesn't seem to have either property), so this might have to be implemented there first. Also, possibly related to #25653 with regards to implementation.