Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace isnumber(), etc. with a single function #14347

Open
nalimilan opened this issue Dec 10, 2015 · 33 comments
Open

Replace isnumber(), etc. with a single function #14347

nalimilan opened this issue Dec 10, 2015 · 33 comments
Assignees
Labels
stdlib Julia's standard library strings "Strings!" unicode Related to unicode characters and encodings

Comments

@nalimilan
Copy link
Member

This was mentioned by @JeffBezanson and @ScottPJones at #14340 (comment). It looks like we could replace some or all of isalpha, isalnum, iscntrl, isgraph, islower, isnumber, isprint, ispunct and isupper with a single function testing for a given Unicode character general category. (isspace and isascii cross several categories and must thus be kept; isdigit is also more restrictive than isnumber.)

The simplest API would be something like charcategory(x::Char) -> Symbol. This would force writing e.g. all(c->charcategory(c) == :L, s) to check whether all characters of a string are uppercase, but it would at least have the advantage of clarity (#14156 (comment)), and would be fast as soon as Jeff's work on anonymous functions is merged.

An intermediate solution would be to keep the most commonly used functions like isnumber, islower, isprint and isupper, but deprecate isalpha, isalnum, isgraph, ispunct and iscntrl. Maybe even isupper and islower could be deprecated: isn't it more common to use lowercase or uppercase if you care about the result, or to check for a specific character? Actually, even isnumber might be deprecated, as it is easily confused with isdigit, which I suspect is the most commonly needed test (for parsing and conversion).

@ScottPJones
Copy link
Contributor

Do you not like ischartype(class, x::Union{Char, AbstractString}), which is shorter and much easier to understand for new people for the string case than using all? (it could still be implemented using all, and not need an anonymous function either). I also think ischartype(Lower, c) is clearer than charcategory(c) == :L, and can be a lot faster as well.

@nalimilan
Copy link
Member Author

A function returning the char category (AFAIK, class isn't a defined Unicode term) is more general. If the symbol comparison is slow, we could use an enum instead.

@ScottPJones
Copy link
Contributor

How is it more necessarily more general?
Using types allows you do express things much more powerfully, and extend them later.
For example:

abstract CharCategory
abstract Alphabetic <: CharCategory
abstract LowerCase <: Alphabetic
abstract UpperCase <: Alphabetic
abstract UpperCategory <: UpperCase
abstract TitleCase     <: UpperCase
abstract MathLetter   <: Alphabetic
...
const global cat_to_type = [UpperCategory, LowerCase, TileCase, MathLetter, OtherLetter, ...]

Also, it turns out that using all in that way to see if a string is considered lowercase or uppercase is not actually correct, see the Unicode documentation:

Note that for “caseless” characters, such as U+02B0, U+1D34, and U+02BD, isLowerCase
and isUpperCase are both True, because the inclusion of a caseless letter in a string is not
criterial for determining the casing of the string—a caseless letter always case maps to itself.

@ScottPJones
Copy link
Contributor

class there was just late night (early morning) screwup on my part, you are right, should have been category.

@ScottPJones
Copy link
Contributor

ischartype{T<:CharCategory}(::Type{T}, ch:Char) is what I really am proposing.

@nalimilan
Copy link
Member Author

What I meant by "general" is that it allows us to save the result (category), e.g. to print it for debugging purposes, or to list all categories encountered in a string. With ischartype, you would need to check for each category separately. It's like forcing you to write issqrt(x, 2) instead of sqrt(x) == 2.

Also, it turns out that using all in that way to see if a string is considered lowercase or uppercase is not actually correct, see the Unicode documentation:
Note that for “caseless” characters, such as U+02B0, U+1D34, and U+02BD, isLowerCase
and isUpperCase are both True, because the inclusion of a caseless letter in a string is not
criterial for determining the casing of the string—a caseless letter always case maps to itself.

Well, you can check whether the characters are in category Lu or Lm or Lo.

We could also make the function even more general, say charprop{T<:UnicodeProperty}(::Type{T}, c::Char), with possible types CharCategory, CharCasing, CharScript, CharWhitespace... It would return a boolean or a symbol/enum depending on the requested information.

@ScottPJones
Copy link
Contributor

There already is a category_code(c) function (not exported, you'd need to use Base.UTF8proc.category_code(c)), that returns an integer 0-29, along with constants such as
UTF8PROC_CATEGORY_LU. Those should probably be redone as an enum.

ischartype wouldn't force you to check each category separately at all - in fact, it makes things simpler, because you can use type inheritance as shown above.
ischartype(Alphabetic, x) for example, or ischartype(UpperCase, x) (returns true for titlecase or uppercase characters).

I also didn't say that you'd just have ischartype, and I do really like your charprop function idea.
So, charprop(CharCategory, ch) would return the abstract category type?

So, if I had a set of UnicodePropertys, as you listed, and the set of types for CharCategory as I proposed above, along with your nice charprop function and my convenience ischartype function
for categories, do you think it might possibly be accepted as a PR?

(I'm not sold on the name for ischartype, if anyone can suggest something better, but not too long,
I'd appreciate it! Let the bikeshedding begin!)

@ScottPJones
Copy link
Contributor

Turns out, ischartype(UpperCase,x) would be the same as
isa(charprop(CharCategory, c), UpperCase), just shorter.

ischartype(cat::CharCategory, c::Char) = isa(charprop(CharCategory, c), cat)`

@nalimilan
Copy link
Member Author

Yes, that was my point. I don't think we need both, especially since you wouldn't write isa(charprop(CharCategory, c), cat) but charprop(CharCategory, c) <: cat.

@ScottPJones
Copy link
Contributor

OK then, would you support a PR where I did all that I talked about above, less the addition of ischartype? If so, I'll go ahead and spend the time on it.

@nalimilan
Copy link
Member Author

I would support it, but let's ask about the opinion of others...

@eschnett
Copy link
Contributor

A common use case is a question such as "is this character an ASCII letter or digit?", as in isascii(ch) && (isalpha(ch) || isdigit(ch)). Can this be efficiently expressed with the new function, maybe more efficiently that currently?

@nalimilan
Copy link
Member Author

Doesn't look like this use case would really gain from the present proposal. isascii and isdigit are quite trivial to compute (they don't call utf8proc at all), and would be kept anyway. Also, calling isalpha is quite wasteful for ASCII characters. So I'd better write this as '0' <= c <= '9' || 'a' <= c <= 'z' || 'A' <= c <= 'Z'.

That said, I guess it would be written inefficiently as something like isascii(ch) && (charprop(CharCategory, c) in (Unicode.Ll, Unicode.Lu, Unicode.Nd). We could keep isalpha to make this shorter, but it would also prompt people to write slow code without realizing.

@ScottPJones
Copy link
Contributor

isalpha(ch) = charprop(CharCategory, c) <: Alphabetic
I think that would be fast - it uses the type hierarchy of the CharCategory types.

@nalimilan
Copy link
Member Author

That's fast, but not as fast as a plain comparison on integer values, which is all you need when working with ASCII strings. In the scenario @eschnett showed, computing the Unicode category is a waste (as I said in my previous comment).

@ScottPJones
Copy link
Contributor

Right, that is for the non ASCIIString case

@JeffBezanson JeffBezanson added needs decision A decision on this change is needed unicode Related to unicode characters and encodings labels Dec 10, 2015
@stevengj
Copy link
Member

I'm not sure what problem is being solved here.

@StefanKarpinski StefanKarpinski removed the needs decision A decision on this change is needed label Sep 13, 2016
@StefanKarpinski StefanKarpinski added this to the 0.6.0 milestone Sep 13, 2016
@StefanKarpinski StefanKarpinski self-assigned this Sep 13, 2016
@JeffBezanson JeffBezanson modified the milestones: 1.0, 0.6.0 Jan 3, 2017
@StefanKarpinski
Copy link
Member

Let's not do this.

@tkelman
Copy link
Contributor

tkelman commented Jul 20, 2017

The types version of this doesn't help anything, but doing it with symbols as originally proposed would work

@StefanKarpinski
Copy link
Member

True, but doesn't really seem worth it.

@nalimilan
Copy link
Member Author

How about deprecating some of these functions, as mentioned in the description? Things like isnumber are really confusing since often you want to use isdigit instead. I suspect we'll want to provide a function to extract the Unicode general category of a character at some point, as possible in many languages. If we do that, we'll have two ways of getting the same information. I'd rather shrink the API as much as possible for 1.0, and add functions later if they turn out to be useful.

@tkelman
Copy link
Contributor

tkelman commented Jul 23, 2017

Agreed, these names are pretty obscure and not that widely used. Better to not be stuck with them for all of 1.x

@stevengj
Copy link
Member

stevengj commented Jul 23, 2017

Note that we do have Base.UTF8proc.category_* to get category codes, although we shouldn't export this without cleaning it up, probably to return an enum.

Even if you export a function to get category codes, however, many of the isfoo functions check for several categories (which is annoying to do by hand, especially since most users don't understand Unicode well enough to implement a function like isprint from category codes) or only check a subset of the categories (like iscntrl).

See also #5939, and #8233. Note also that, if I recall correctly from the discussion, there are several modern languages that choose to have these functions, e.g. Go.

@tkelman
Copy link
Contributor

tkelman commented Jul 23, 2017

Can we move them under a namespace? They're not especially generic things that you would ever call on anything but a Char or AbstractString. Go's spellings of these are also notably more obvious than ours.

They're also not all that widely-used, looks like on the order of 5-15 packages depending which one.

@nalimilan
Copy link
Member Author

Moving these to a Unicode module sounds like a good idea. Note that the Go functions are only provided by a package. OTC, Rust includes them in the stdlib (though due to their classic OOP approach these functions are grouped under the char type).

What's even more interesting is that Swift has removed these functions in version 3 in favor of patterns like c in CharacterSet.alphanumerics. This is very similar to the solution proposed here. We could provide meta-categories like Swift, i.e. one for each of the existing functions, so that users do not have to list Unicode categories one by one. The advantage is that with a unified pattern, you can query a lot of different properties (not only categories).

@StefanKarpinski
Copy link
Member

All character class stuff should be moved into a Unicode stdlib package, including all these functions. The Unicode prefix will also make it clear that these are string-specific functions.

@StefanKarpinski StefanKarpinski added excision Removal of code from Base or the repository stdlib Julia's standard library strings "Strings!" labels Nov 20, 2017
@JeffBezanson
Copy link
Member

That package can also re-export stuff from Base.UTF8proc.

@nalimilan
Copy link
Member Author

See #25021.

@nalimilan
Copy link
Member Author

Functions have been moved to the Unicode stdlib module. Keeping this issue open since it would still make sense to provide an API to get Unicode character properties like general category.

@stevengj
Copy link
Member

I think an API to get the category code should be a separate issue.

@StefanKarpinski
Copy link
Member

We can iterate on the API of the Unicode package in the future, fortunately.

@Sacha0
Copy link
Member

Sacha0 commented Dec 19, 2017

Seems this issue's 1.0 items have been completed?

@StefanKarpinski StefanKarpinski modified the milestones: 1.0, 1.x Dec 19, 2017
@StefanKarpinski StefanKarpinski removed the excision Removal of code from Base or the repository label Dec 19, 2017
@StefanKarpinski
Copy link
Member

Yup, kicked to the 1.x milestone for further iteration.

@DilumAluthge DilumAluthge removed this from the 1.x milestone Mar 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stdlib Julia's standard library strings "Strings!" unicode Related to unicode characters and encodings
Projects
None yet
Development

No branches or pull requests

9 participants