Replace isnumber(), etc. with a single function #14347

nalimilan · 2015-12-10T09:58:01Z

This was mentioned by @JeffBezanson and @ScottPJones at #14340 (comment). It looks like we could replace some or all of isalpha, isalnum, iscntrl, isgraph, islower, isnumber, isprint, ispunct and isupper with a single function testing for a given Unicode character general category. (isspace and isascii cross several categories and must thus be kept; isdigit is also more restrictive than isnumber.)

The simplest API would be something like charcategory(x::Char) -> Symbol. This would force writing e.g. all(c->charcategory(c) == :L, s) to check whether all characters of a string are uppercase, but it would at least have the advantage of clarity (#14156 (comment)), and would be fast as soon as Jeff's work on anonymous functions is merged.

An intermediate solution would be to keep the most commonly used functions like isnumber, islower, isprint and isupper, but deprecate isalpha, isalnum, isgraph, ispunct and iscntrl. Maybe even isupper and islower could be deprecated: isn't it more common to use lowercase or uppercase if you care about the result, or to check for a specific character? Actually, even isnumber might be deprecated, as it is easily confused with isdigit, which I suspect is the most commonly needed test (for parsing and conversion).

The text was updated successfully, but these errors were encountered:

ScottPJones · 2015-12-10T10:17:53Z

Do you not like ischartype(class, x::Union{Char, AbstractString}), which is shorter and much easier to understand for new people for the string case than using all? (it could still be implemented using all, and not need an anonymous function either). I also think ischartype(Lower, c) is clearer than charcategory(c) == :L, and can be a lot faster as well.

nalimilan · 2015-12-10T10:19:57Z

A function returning the char category (AFAIK, class isn't a defined Unicode term) is more general. If the symbol comparison is slow, we could use an enum instead.

ScottPJones · 2015-12-10T10:41:26Z

How is it more necessarily more general?
Using types allows you do express things much more powerfully, and extend them later.
For example:

abstract CharCategory
abstract Alphabetic <: CharCategory
abstract LowerCase <: Alphabetic
abstract UpperCase <: Alphabetic
abstract UpperCategory <: UpperCase
abstract TitleCase     <: UpperCase
abstract MathLetter   <: Alphabetic
...
const global cat_to_type = [UpperCategory, LowerCase, TileCase, MathLetter, OtherLetter, ...]

Also, it turns out that using all in that way to see if a string is considered lowercase or uppercase is not actually correct, see the Unicode documentation:

Note that for “caseless” characters, such as U+02B0, U+1D34, and U+02BD, isLowerCase
and isUpperCase are both True, because the inclusion of a caseless letter in a string is not
criterial for determining the casing of the string—a caseless letter always case maps to itself.

ScottPJones · 2015-12-10T10:42:46Z

class there was just late night (early morning) screwup on my part, you are right, should have been category.

ScottPJones · 2015-12-10T10:44:11Z

ischartype{T<:CharCategory}(::Type{T}, ch:Char) is what I really am proposing.

nalimilan · 2015-12-10T13:41:25Z

What I meant by "general" is that it allows us to save the result (category), e.g. to print it for debugging purposes, or to list all categories encountered in a string. With ischartype, you would need to check for each category separately. It's like forcing you to write issqrt(x, 2) instead of sqrt(x) == 2.

Also, it turns out that using all in that way to see if a string is considered lowercase or uppercase is not actually correct, see the Unicode documentation:
Note that for “caseless” characters, such as U+02B0, U+1D34, and U+02BD, isLowerCase
and isUpperCase are both True, because the inclusion of a caseless letter in a string is not
criterial for determining the casing of the string—a caseless letter always case maps to itself.

Well, you can check whether the characters are in category Lu or Lm or Lo.

We could also make the function even more general, say charprop{T<:UnicodeProperty}(::Type{T}, c::Char), with possible types CharCategory, CharCasing, CharScript, CharWhitespace... It would return a boolean or a symbol/enum depending on the requested information.

ScottPJones · 2015-12-10T14:48:06Z

There already is a category_code(c) function (not exported, you'd need to use Base.UTF8proc.category_code(c)), that returns an integer 0-29, along with constants such as
UTF8PROC_CATEGORY_LU. Those should probably be redone as an enum.

ischartype wouldn't force you to check each category separately at all - in fact, it makes things simpler, because you can use type inheritance as shown above.
ischartype(Alphabetic, x) for example, or ischartype(UpperCase, x) (returns true for titlecase or uppercase characters).

I also didn't say that you'd just have ischartype, and I do really like your charprop function idea.
So, charprop(CharCategory, ch) would return the abstract category type?

So, if I had a set of UnicodePropertys, as you listed, and the set of types for CharCategory as I proposed above, along with your nice charprop function and my convenience ischartype function
for categories, do you think it might possibly be accepted as a PR?

(I'm not sold on the name for ischartype, if anyone can suggest something better, but not too long,
I'd appreciate it! Let the bikeshedding begin!)

ScottPJones · 2015-12-10T14:53:32Z

Turns out, ischartype(UpperCase,x) would be the same as
isa(charprop(CharCategory, c), UpperCase), just shorter.

ischartype(cat::CharCategory, c::Char) = isa(charprop(CharCategory, c), cat)`

nalimilan · 2015-12-10T14:56:47Z

Yes, that was my point. I don't think we need both, especially since you wouldn't write isa(charprop(CharCategory, c), cat) but charprop(CharCategory, c) <: cat.

ScottPJones · 2015-12-10T15:30:38Z

OK then, would you support a PR where I did all that I talked about above, less the addition of ischartype? If so, I'll go ahead and spend the time on it.

nalimilan · 2015-12-10T15:36:56Z

I would support it, but let's ask about the opinion of others...

eschnett · 2015-12-10T16:36:09Z

A common use case is a question such as "is this character an ASCII letter or digit?", as in isascii(ch) && (isalpha(ch) || isdigit(ch)). Can this be efficiently expressed with the new function, maybe more efficiently that currently?

nalimilan · 2015-12-10T17:02:00Z

Doesn't look like this use case would really gain from the present proposal. isascii and isdigit are quite trivial to compute (they don't call utf8proc at all), and would be kept anyway. Also, calling isalpha is quite wasteful for ASCII characters. So I'd better write this as '0' <= c <= '9' || 'a' <= c <= 'z' || 'A' <= c <= 'Z'.

That said, I guess it would be written inefficiently as something like isascii(ch) && (charprop(CharCategory, c) in (Unicode.Ll, Unicode.Lu, Unicode.Nd). We could keep isalpha to make this shorter, but it would also prompt people to write slow code without realizing.

ScottPJones · 2015-12-10T17:44:18Z

isalpha(ch) = charprop(CharCategory, c) <: Alphabetic
I think that would be fast - it uses the type hierarchy of the CharCategory types.

nalimilan · 2015-12-10T17:55:44Z

That's fast, but not as fast as a plain comparison on integer values, which is all you need when working with ASCII strings. In the scenario @eschnett showed, computing the Unicode category is a waste (as I said in my previous comment).

ScottPJones · 2015-12-10T18:05:47Z

Right, that is for the non ASCIIString case

stevengj · 2015-12-15T15:42:13Z

I'm not sure what problem is being solved here.

StefanKarpinski · 2017-07-20T19:08:45Z

Let's not do this.

tkelman · 2017-07-20T19:33:41Z

The types version of this doesn't help anything, but doing it with symbols as originally proposed would work

StefanKarpinski · 2017-07-20T22:12:40Z

True, but doesn't really seem worth it.

nalimilan · 2017-07-22T21:10:31Z

How about deprecating some of these functions, as mentioned in the description? Things like isnumber are really confusing since often you want to use isdigit instead. I suspect we'll want to provide a function to extract the Unicode general category of a character at some point, as possible in many languages. If we do that, we'll have two ways of getting the same information. I'd rather shrink the API as much as possible for 1.0, and add functions later if they turn out to be useful.

tkelman · 2017-07-23T11:09:53Z

Agreed, these names are pretty obscure and not that widely used. Better to not be stuck with them for all of 1.x

stevengj · 2017-07-23T11:35:11Z

Note that we do have Base.UTF8proc.category_* to get category codes, although we shouldn't export this without cleaning it up, probably to return an enum.

Even if you export a function to get category codes, however, many of the isfoo functions check for several categories (which is annoying to do by hand, especially since most users don't understand Unicode well enough to implement a function like isprint from category codes) or only check a subset of the categories (like iscntrl).

See also #5939, and #8233. Note also that, if I recall correctly from the discussion, there are several modern languages that choose to have these functions, e.g. Go.

tkelman · 2017-07-23T12:39:35Z

Can we move them under a namespace? They're not especially generic things that you would ever call on anything but a Char or AbstractString. Go's spellings of these are also notably more obvious than ours.

They're also not all that widely-used, looks like on the order of 5-15 packages depending which one.

nalimilan · 2017-07-23T22:41:51Z

Moving these to a Unicode module sounds like a good idea. Note that the Go functions are only provided by a package. OTC, Rust includes them in the stdlib (though due to their classic OOP approach these functions are grouped under the char type).

What's even more interesting is that Swift has removed these functions in version 3 in favor of patterns like c in CharacterSet.alphanumerics. This is very similar to the solution proposed here. We could provide meta-categories like Swift, i.e. one for each of the existing functions, so that users do not have to list Unicode categories one by one. The advantage is that with a unified pattern, you can query a lot of different properties (not only categories).

StefanKarpinski · 2017-11-20T18:54:56Z

All character class stuff should be moved into a Unicode stdlib package, including all these functions. The Unicode prefix will also make it clear that these are string-specific functions.

JeffBezanson · 2017-11-20T21:26:45Z

That package can also re-export stuff from Base.UTF8proc.

nalimilan · 2017-12-10T22:01:49Z

See #25021.

nalimilan · 2017-12-13T08:32:42Z

Functions have been moved to the Unicode stdlib module. Keeping this issue open since it would still make sense to provide an API to get Unicode character properties like general category.

stevengj · 2017-12-13T14:59:43Z

I think an API to get the category code should be a separate issue.

StefanKarpinski · 2017-12-13T16:58:40Z

We can iterate on the API of the Unicode package in the future, fortunately.

Sacha0 · 2017-12-19T01:32:12Z

Seems this issue's 1.0 items have been completed?

StefanKarpinski · 2017-12-19T03:01:09Z

Yup, kicked to the 1.x milestone for further iteration.

JeffBezanson added needs decision A decision on this change is needed unicode Related to unicode characters and encodings labels Dec 10, 2015

ScottPJones mentioned this issue Dec 13, 2015

Unicode character properties #14387

Closed

StefanKarpinski removed the needs decision A decision on this change is needed label Sep 13, 2016

StefanKarpinski added this to the 0.6.0 milestone Sep 13, 2016

StefanKarpinski self-assigned this Sep 13, 2016

JeffBezanson modified the milestones: 1.0, 0.6.0 Jan 3, 2017

StefanKarpinski closed this as completed Jul 20, 2017

yuyichao mentioned this issue Sep 19, 2017

isnumber(::Number) undefined #23767

Closed

StefanKarpinski reopened this Sep 19, 2017

StefanKarpinski mentioned this issue Nov 3, 2017

wip: string overhaul #24439

Closed

StefanKarpinski added excision Removal of code from Base or the repository stdlib Julia's standard library strings "Strings!" labels Nov 20, 2017

stevengj mentioned this issue Dec 9, 2017

Deprecate isnumber() in favor of isnumeric() #25008

Closed

nalimilan mentioned this issue Dec 10, 2017

Move Unicode-related functions to new Unicode stdlib package #25021

Merged

StefanKarpinski modified the milestones: 1.0, 1.x Dec 19, 2017

StefanKarpinski removed the excision Removal of code from Base or the repository label Dec 19, 2017

nalimilan mentioned this issue Jan 4, 2018

Rename stdlib Unicode to Strings? #25394

Closed

DilumAluthge removed this from the 1.x milestone Mar 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace isnumber(), etc. with a single function #14347

Replace isnumber(), etc. with a single function #14347

nalimilan commented Dec 10, 2015

ScottPJones commented Dec 10, 2015

nalimilan commented Dec 10, 2015

ScottPJones commented Dec 10, 2015

ScottPJones commented Dec 10, 2015

ScottPJones commented Dec 10, 2015

nalimilan commented Dec 10, 2015

ScottPJones commented Dec 10, 2015

ScottPJones commented Dec 10, 2015

nalimilan commented Dec 10, 2015

ScottPJones commented Dec 10, 2015

nalimilan commented Dec 10, 2015

eschnett commented Dec 10, 2015

nalimilan commented Dec 10, 2015

ScottPJones commented Dec 10, 2015

nalimilan commented Dec 10, 2015

ScottPJones commented Dec 10, 2015

stevengj commented Dec 15, 2015

StefanKarpinski commented Jul 20, 2017

tkelman commented Jul 20, 2017

StefanKarpinski commented Jul 20, 2017

nalimilan commented Jul 22, 2017

tkelman commented Jul 23, 2017

stevengj commented Jul 23, 2017 •

edited by nalimilan

Loading

tkelman commented Jul 23, 2017 •

edited

Loading

nalimilan commented Jul 23, 2017

StefanKarpinski commented Nov 20, 2017

JeffBezanson commented Nov 20, 2017

nalimilan commented Dec 10, 2017

nalimilan commented Dec 13, 2017

stevengj commented Dec 13, 2017

StefanKarpinski commented Dec 13, 2017

Sacha0 commented Dec 19, 2017

StefanKarpinski commented Dec 19, 2017

Replace isnumber(), etc. with a single function #14347

Replace isnumber(), etc. with a single function #14347

Comments

nalimilan commented Dec 10, 2015

ScottPJones commented Dec 10, 2015

nalimilan commented Dec 10, 2015

ScottPJones commented Dec 10, 2015

ScottPJones commented Dec 10, 2015

ScottPJones commented Dec 10, 2015

nalimilan commented Dec 10, 2015

ScottPJones commented Dec 10, 2015

ScottPJones commented Dec 10, 2015

nalimilan commented Dec 10, 2015

ScottPJones commented Dec 10, 2015

nalimilan commented Dec 10, 2015

eschnett commented Dec 10, 2015

nalimilan commented Dec 10, 2015

ScottPJones commented Dec 10, 2015

nalimilan commented Dec 10, 2015

ScottPJones commented Dec 10, 2015

stevengj commented Dec 15, 2015

StefanKarpinski commented Jul 20, 2017

tkelman commented Jul 20, 2017

StefanKarpinski commented Jul 20, 2017

nalimilan commented Jul 22, 2017

tkelman commented Jul 23, 2017

stevengj commented Jul 23, 2017 • edited by nalimilan Loading

tkelman commented Jul 23, 2017 • edited Loading

nalimilan commented Jul 23, 2017

StefanKarpinski commented Nov 20, 2017

JeffBezanson commented Nov 20, 2017

nalimilan commented Dec 10, 2017

nalimilan commented Dec 13, 2017

stevengj commented Dec 13, 2017

StefanKarpinski commented Dec 13, 2017

Sacha0 commented Dec 19, 2017

StefanKarpinski commented Dec 19, 2017

stevengj commented Jul 23, 2017 •

edited by nalimilan

Loading

tkelman commented Jul 23, 2017 •

edited

Loading