Skip to content

isalpha should use Unicode property Alphabetic; rename to isletter #26932

@digital-carver

Description

@digital-carver

Right now, it simply checks whether the given character is in one of the L categories (isalpha(c::AbstractChar) = UTF8PROC_CATEGORY_LU <= category_code(c) <= UTF8PROC_CATEGORY_LO). This is almost correct, except that the Unicode Alphabetic property belongs to these categories, to a Nl category (number-like letters, eg. Roman numerals), and crucially to a set of characters defined to be Other_Alphabetic that live in Mc and Mn (spacing and non-spacing marks). A lot of codepoints in Indic texts, for eg. most occurrences of vowels in Tamil texts, are characters found in this Other_Alphabetic list.

Among the few other (programming) languages I tried this check on, Ruby (\p{Alpha}) and Java (Character.isAlphabetic) get this right (Java documentation explicitly explains the Alphabetic property, Python 2 and 3 both ("அதிகாலை".isalpha()) seem to be getting it wrong. Perl also gets the Other_Alphabetic characters correctly identified under \p{Alpha} (though it also seems to have additional magic on top).

Other_Alphabetic apparently belongs to 1300 code points according to the Unicode PropList, so there are letters from quite a few language scripts that currently fail isalpha.

I'm not sure if utf8proc supports querying for either the Alphabetic or the Other_Alphabetic property (the utf8proc_property_struct doesn't seem to have either property), so this might have to be implemented there first. Also, possibly related to #25653 with regards to implementation.

Metadata

Metadata

Assignees

No one assigned

    Labels

    unicodeRelated to unicode characters and encodings

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions