add graphemes(s) function to iterate over string graphemes #9261

stevengj · 2014-12-07T02:36:28Z

utf8proc/libmojibake includes a feature to break a string up into graphemes (as defined by UAX#29). This PR exports a new function graphemes(s::AbstractString) to expose this functionality as an iterator over substrings of s.

e.g. collect(graphemes("b̀lahβlahb̂láh")) yields SubString{UTF8String}["b̀","l","a","h","β","l","a","h","b̂","l","á","h"]. Note that these are strings, not characters, because graphemes may consist of multiple codepoints (e.g. "b̀" is "\u0062\u0300" in any normalization).

(This may be useful e.g. in the REPL if we want the arrow keys etcetera to work with graphemes as recommended by UAX#29.)

cc: @jiahao, @Keno

stevengj · 2014-12-07T02:40:26Z

Note also that in order to implement this cleanly I had to correct an inconsistency in UTF8String slicing: previously s[i:j] allowed i and j to be invalid character indices (effectively by moving them right and left, respectively, to the next valid character indices) except when j was > endof(s) in which case it threw a BoundsError(). I changed this to allow any 0 < j ≤ length(s.data) and added a test case.

StefanKarpinski · 2014-12-07T03:45:38Z

Clearly useful and should be exposed. @JeffBezanson, I thought we had changed this so that the indices in range indexing had to be valid character offsets. Did I misremember that or did we not pull the trigger?

ivarne · 2014-12-07T04:41:00Z

We didn't pull the trigger last time I raised the IndexError issue in #7811, but that is a separate issue.

ivarne · 2014-12-07T04:45:08Z

base/utf8.jl

@@ -103,7 +103,7 @@ function getindex(s::UTF8String, r::UnitRange{Int})
    if !is_utf8_start(d[i])
        i = nextind(s,i)
    end
-    if j > endof(s)
+    if j > length(d)


~~Isn't length the number of code points? I think we want sizeof.~~

Sorry, it's length(d) not length(s), so this is correct, but confusing.

tkelman · 2014-12-08T00:46:57Z

I keep trying to click re-build PR on this in appveyor (status.julialang.org was down, it's back now) but it doesn't want to work. If you push to this again it should fire off another build.

edit: and the latest build timed out due to #7942, but at least appveyor is letting me click rebuild this time

stevengj · 2014-12-08T15:40:18Z

Now that I look at this more closely, I think that utf8proc's approach of allocating a new string with 0xff beween graphemes is horribly inefficient. No normalization etc. is required for grapheme detection, so all you really need is an isgraphemepair(c1::Char, c2::Char) function that returns whether a pair of Chars are in the same grapheme, and then you can build nextgraphemeind(s, i) and a graphemes(s) iterator on top of that.

(This will require us to re-implement utf8proc's grapheme code, but that's actually not such a big deal: we have to do that anyway since utf8proc's grapheme rules are way out of date as described in JuliaStrings/utf8proc#19.)

nalimilan · 2014-12-08T15:50:42Z

@stevengj Would that mean that string iterators could be made to work on graphemes and still be reasonably efficient? (Cf. #6165 (comment) and subsequent comments.)

StefanKarpinski · 2014-12-08T16:03:06Z

The main issue there would be creating a new string object for every iteration, which currently would be very expensive. With the string work I'm doing it could be much cheaper – I presume we can be certain that no grapheme is ever longer than 7 bytes (or 15 bytes on 64-bit platforms)? Still, I'm not sure if this is a good idea. I'd be interested in comparing this to what Go does (they invented UTF-8 after all) – I believe that have something called a "rune" which is not quite a code point. Anyone familiar with this?

stevengj · 2014-12-08T16:34:33Z

@StefanKarpinski, actually, the graphemes iterator returns substrings, because the caller might want to know the offsets in the original string; since SubString is an immutable struct of just a pointer and a couple integers, presumably it doesn't have to be heap-allocated?

With my suggested non-utf8proc strategy, where performance is critical you could also use nextgraphemeind(s, i) and prevgraphemeind(s, i) and just work with indices.

nalimilan · 2014-12-08T16:34:51Z

No, but reading http://blog.golang.org/strings it seems that a rune "means exactly the same as "code point", with one interesting addition", which seems to be that rune is a type alias for int32. And Go iterates over runes.

See also http://blog.golang.org/normalization for an explanation of iteration and normalization. This page notes that the Norm package provides an iterator over normalized characters. That page also says that the Unicode standard defines a Stream Safe text format, with a limit of 30 codepoints that can be combined into a grapheme, and Go apparently uses that assumption to make some functions more efficient.

But the default iterator does not appear to do any normalization. For example, here's the result of iterating over a Unicode string, including the two forms of "noël":

package main

import "fmt"

func main() {
        for i, rune := range "Hello, 世界 noe\u0308l et noël" {
                fmt.Printf("%d: %c\n", i, rune)
        }
}

0: H
1: e
2: l
3: l
4: o
5: ,
6:  
7: 世
10: 界
13:  
14: n
15: o
16: e
17: ̈
19: l
20:  
21: e
22: t
23:  
24: n
25: o
26: ë
28: l

I find this a bit unfortunate, but there may be good reasons not to normalize. Though I wonder whether at least normalization couldn't be performed on the fly, even if iteration continues to go over codepoints (runes in Go), and not graphemes. Not sure we'd gain much, if some code points still cannot be considered as "user-perceived characters".

stevengj · 2014-12-08T16:46:29Z

Yes, Go rune == Julia Char.

(@nalimilan, I think it would be possible to normalize on the fly in an iterator, albeit a bit hairy to code, but so far there hasn't been a lot of demand for this...so far only three packages are using normalize_string at all. I agree with Go's decision not to normalize iterators by default; in most applications this would be a lot of expense for little purpose, because most applications probably don't care whether a string contains combining characters etc...if you are looking inside the string at all, it is usually for ascii substrings like XML tags and commas. Nor is normalization a substitute for grapheme segmentation.)

nalimilan · 2014-12-13T12:28:57Z

@stevengj Actually, my vote now goes to require explicit choice of the iteration method. I don't think there's a default which makes more sense than another one, it completely depends on what you want to do. But offering a default is dangerous since it means people don't get to think about what they need and the assumptions they can make.

stevengj · 2014-12-15T19:42:11Z

I pushed a new version of this commit that uses the new is_grapheme_break function in libmojibake (which has been submitted upstream; still waiting to hear back) to do grapheme breaking without allocating an auxiliary array with 0xFF marker bytes.

~~I also included a Cbool type, since utf8proc uses bool (from stdbool.h or C++), and I noticed that this was missing. It seems to be a UInt8 on all supported platforms.~~ Whoops, I forgot the discussion in #6529.

…ubstrings) of a string s

stevengj · 2014-12-16T19:10:38Z

Anyone understand the AppVeyor failure? It looks like it just died in the middle of the build.

jiahao · 2014-12-16T19:14:29Z

I believe AppVeyor times out after 40 minutes.

tkelman · 2014-12-16T21:32:41Z

Yeah it's a freeze/timeout. The earlier failure was some breakage from 9266, now fixed in 9366 so I restarted this and other PR's that had been caught by the problem.

Unfortunately we're still getting freezes and timeouts. I'm not sure what the cause of those is, whether it's a Julia problem or the AppVeyor machine is losing its connection or something. Since it only ever happens for 64-bit Julia I suspect it's something in Julia. See http://help.appveyor.com/discussions/problems/1206-inexplicable-timeouts and appveyor/ci#86 - the build usually seems to get stuck either during the system image or the first couple tests, which only takes about 10 minutes to get to. If AppVeyor had a feature like Travis where the build fails if no output is received for some amount of time, we could at least move through our queue faster.

tkelman · 2014-12-17T00:42:40Z

And it's timing out again. Feodor's looking into it, see if he can find anything.

stevengj · 2014-12-17T20:13:52Z

Seems like the tests are passing again. Since everyone seems to be in favor of including this functionality and I've heard no objections to the syntax (modulo larger questions about overhauling s[i] codepoint indexing entirely), I'm going to go ahead and merge.

add graphemes(s) function to iterate over string graphemes

tonyhffong · 2014-12-21T11:21:59Z

Would this be something backported to 0.3? or should I think about freezing TermWin.jl version for 0.3 if I choose to use graphemes?

tkelman · 2014-12-21T11:31:10Z

This relies on functionality that's only in libmojibake, not upstream utf8proc. Progress is being made on upstreaming our changes, as I understand it. But we haven't backported the dependency change so this would be difficult to backport without also backporting all the other mojibake-related changes.

tonyhffong · 2014-12-21T11:49:34Z

Ok, thanks.

matthieugomez · 2015-11-06T19:05:06Z

I'm using graphemes today : why does next return a string s[i:j] rather than SubString(s, i, j)?

stevengj · 2015-11-06T22:05:52Z

@matthieugomez, I don't understand your question. graphemes iterates over substrings, e.g. collect(graphemes("Hello, 世界 noe\u0308l et noël")) returns an Array{SubString{UTF8String},1}.

matthieugomez · 2015-11-06T22:08:04Z

I'm on my phone but first(graphemes("hello")) is ASCIIString not a Substring

On Friday, November 6, 2015, Steven G. Johnson notifications@github.com
wrote:

@matthieugomez https://github.com/matthieugomez, I don't understand
your question. graphemes iterates over substrings, e.g. collect(graphemes("Hello,
世界 noe\u0308l et noël")) returns an Array{SubString{UTF8String},1}.

—
Reply to this email directly or view it on GitHub
#9261 (comment).

This is consistent with eltype() and therefore collect(). Fixes #9261.

nalimilan · 2015-11-06T22:29:05Z

Good catch! See #13903.

stevengj · 2015-11-06T22:32:38Z

Oh, right; I think I thought that s[i:j] returned a substring at the time.

matthieugomez · 2015-11-06T22:33:00Z

Yes so the enumerations generates strings. These strings are converted into
substrings when calling collect (due to your eltype definition). But they
are just substrings of themselves. It seems like the way to fix this is to
replace s[i:j] by SubString(s, i, j) in next

On Friday, November 6, 2015, Matthieu Gomez gomez.matthieu@gmail.com
wrote:

I'm on my phone but first(graphemes("hello")) is ASCIIString not a
Substring

On Friday, November 6, 2015, Steven G. Johnson <notifications@github.com
javascript:_e(%7B%7D,'cvml','notifications@github.com');> wrote:

@matthieugomez https://github.com/matthieugomez, I don't understand
your question. graphemes iterates over substrings, e.g. collect(graphemes("Hello,
世界 noe\u0308l et noël")) returns an Array{SubString{UTF8String},1}.

—
Reply to this email directly or view it on GitHub
#9261 (comment).

matthieugomez · 2015-11-07T00:40:01Z

Thanks. I have another question. With your type definition, some functions defined for AbstractStrings could directly work on GraphemeIterator, like chr2ind, nextind etc. but currently one needs to redefine all of them. Could GraphemeIterator be an AbstractString? If not could there be a common parent of ABstractStrings and Grapheme iterator?

This is consistent with eltype() and therefore collect(). Fixes #9261. (cherry picked from commit f379e08) ref #13903

nalimilan · 2015-11-07T11:30:33Z

@matthieugomez GraphemeIterator sounds very different from an AbstractString. It's more like a collection of strings, actually. What would chr2ind or nextind do on a GraphemeIterator?

matthieugomez · 2015-11-07T14:20:29Z

@nalimilan Sorry i was not clear, I'll try to develop.

The first reason is that AbstractString and GraphemeIterator are both iterators on characters. The only difference is that in a GraphemeIterator, a character is a SubString while in a AbstractString, it is a Char (well I think it has to be a Char but I'm actually not sure).

Another reason is that they have the same implementation. AbstractStrings and GraphemeIterator are both indexable array where only some indices are valid (only the definition of "valid" differs). For these structures, functions such as chr2ind and nextind make sense : they allow to map the next element to it starting index in the array. Moreover, since both iterators use the starting index as the iterator state, chr2ind and nextind should already work on GraphemeIterator (for instance nextind returns the starting index of the next grapheme) — yet they're only defined on AbstractString.

I'm writing this because I'm working on a package that computes various string distances. Most of these distances work by comparing characters. I wrote the package with the idea that a character = a Char. But it'd be nice if the user could also compare strings with the notion that a character = a grapheme. Looking at my code, I just need to sign all methods with a Union{AbstractString, Grapheme}, define SubString(x::GraphemeIterator,...) = graphemes(SubString(x.s,...)), and extend chr2ind and nextind to work on Graphemes (by copying the code used to define these functions on AbstractStrings). We may gain in having a more formal notion of inheritance here.

nalimilan · 2015-11-07T14:44:26Z

I think it would be more logical to consider iterating over characters and iterating over graphemes as two different ways of going over a string. You could define a CharacterIterator similar to GraphemeIterator, and use one or the other, or let the user pass one of these two types as an argument. Note that currently, GraphemeIterator isn't indexable: would you pass it a byte index or an index expressed in number of graphemes?

(Actually, one could even advocate that iterating over a string shouldn't be possible: you would be required to do for c in characters(s) or for c in graphemes(s), so that you make an explicit choice.)

matthieugomez · 2015-11-07T14:58:35Z

I need more than the iterator interface: I also need functions such as nextind, chr2ind (which differ between chars vs graphemes), or endof (which should return the same value for both).

To sum up my point:

Defining nextind and chr2ind on GraphemeIterator would allow to write more generic code. A good example is the definition of an iterator on q-grams (substrings composed of q subsequent characters of a string)
It turns out that, while nextind and chr2indare defined with an AbstractString signature, they would directly work on GraphemeIterators, due to the way start, next are implemented in this pull request.

So it seems like there's some duplication going on, but I'm not sure what the best solution is.

matthieugomez · 2015-11-09T23:11:54Z

I'm trying to follow @nalimilan idea by defining two different iterating wrappers for AbstractString. But it gets hard to rewrite stuff like split, join, etc on these iterators so that the iteration property of the string is conserved after passing the function. There must be a better solution (unless i'm doing it wrong).

nalimilan · 2015-11-10T09:03:31Z

@matthieugomez What do you mean exactly? Are you trying to allow calling split on these iterators?

matthieugomez · 2015-11-10T13:28:08Z

Yes.
I need to Substring or split these iterators (ie apply a function on the wrapped string and then wrap the output so that the iterator property of the original string is conserved). With enough of these methods these iterators start to look like strings, so it's not clear to me why they're not AbstractString.

Another thing I don't really like with the solution of implementing two custom iterator types is that this will make the code harder to read, cannot be copy pasted etc. It'd be nice to have the grapheme iteration "for free" rather than rewriting everything for new iterator objects rather than string objects.
I've written a short implementation of Grapheme as an AbstractString (comments mentions some issues) here

nalimilan · 2015-11-10T19:52:56Z

I think you misunderstood my suggestion. My proposal was to always work with the standard string, and apply operations (split...) to it, but to also carry an iterator type that you call on that string when you need to iterate over its characters or its graphemes. But I'm not in position to evaluate whether this suits your needs.

stevengj added unicode Related to unicode characters and encodings feature labels Dec 7, 2014

stevengj force-pushed the graphemes branch from ce29e63 to 47754eb Compare December 7, 2014 02:37

This was referenced Dec 7, 2014

incorrect extended grapheme segmentation JuliaStrings/utf8proc#19

Closed

Introduce splice_buffer! abstraction for IOBuffer in the REPL #6429

Merged

UTF-8 combining characters and normalization in reverse() #6165

Closed

ivarne reviewed Dec 7, 2014
View reviewed changes

stevengj force-pushed the graphemes branch from 47754eb to aa2a579 Compare December 7, 2014 14:51

stevengj force-pushed the graphemes branch from aa2a579 to 5ad2e66 Compare December 8, 2014 02:15

nalimilan mentioned this pull request Dec 10, 2014

Restrict indexing into strings to a special ByteIndex or StringIndex type #9297

Closed

stevengj mentioned this pull request Dec 12, 2014

Update graphemes for Unicode 7 JuliaStrings/utf8proc#20

Merged

tonyhffong mentioned this pull request Dec 14, 2014

string input field's cursor movement should be over graphemes tonyhffong/TermWin.jl#10

Closed

stevengj force-pushed the graphemes branch 3 times, most recently from 9d5dc2b to 2b44660 Compare December 15, 2014 19:41

stevengj force-pushed the graphemes branch from 2b44660 to 29d55e8 Compare December 15, 2014 19:46

add graphemes(s) function to iterate over graphemes (represented by s…

3c1b839

…ubstrings) of a string s

stevengj added a commit that referenced this pull request Dec 17, 2014

Merge pull request #9261 from stevengj/graphemes

2364748

add graphemes(s) function to iterate over string graphemes

stevengj merged commit 2364748 into JuliaLang:master Dec 17, 2014

nalimilan added a commit that referenced this pull request Nov 6, 2015

Fix next(::GraphemeIterator) to return SubString

f379e08

This is consistent with eltype() and therefore collect(). Fixes #9261.

nalimilan mentioned this pull request Nov 6, 2015

Fix next(::GraphemeIterator) to return SubString #13903

Merged

stevengj deleted the graphemes branch November 6, 2015 22:32

nalimilan added a commit that referenced this pull request Nov 7, 2015

Fix next(::GraphemeIterator) to return SubString

61230f2

This is consistent with eltype() and therefore collect(). Fixes #9261. (cherry picked from commit f379e08) ref #13903

nalimilan mentioned this pull request Mar 29, 2016

RFC, WIP: highlander #14383

Closed

add graphemes(s) function to iterate over string graphemes #9261

add graphemes(s) function to iterate over string graphemes #9261

Conversation

stevengj commented Dec 7, 2014

stevengj commented Dec 7, 2014

StefanKarpinski commented Dec 7, 2014

ivarne commented Dec 7, 2014

ivarne Dec 7, 2014

Choose a reason for hiding this comment

tkelman commented Dec 8, 2014

stevengj commented Dec 8, 2014

nalimilan commented Dec 8, 2014

StefanKarpinski commented Dec 8, 2014

stevengj commented Dec 8, 2014

nalimilan commented Dec 8, 2014

stevengj commented Dec 8, 2014

nalimilan commented Dec 13, 2014

stevengj commented Dec 15, 2014

stevengj commented Dec 16, 2014

jiahao commented Dec 16, 2014

tkelman commented Dec 16, 2014

tkelman commented Dec 17, 2014

stevengj commented Dec 17, 2014

tonyhffong commented Dec 21, 2014

tkelman commented Dec 21, 2014

tonyhffong commented Dec 21, 2014

matthieugomez commented Nov 6, 2015

stevengj commented Nov 6, 2015

matthieugomez commented Nov 6, 2015

nalimilan commented Nov 6, 2015

stevengj commented Nov 6, 2015

matthieugomez commented Nov 6, 2015

matthieugomez commented Nov 7, 2015

nalimilan commented Nov 7, 2015

matthieugomez commented Nov 7, 2015

nalimilan commented Nov 7, 2015

matthieugomez commented Nov 7, 2015

matthieugomez commented Nov 9, 2015

nalimilan commented Nov 10, 2015

matthieugomez commented Nov 10, 2015

nalimilan commented Nov 10, 2015