-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add graphemes(s) function to iterate over string graphemes #9261
Conversation
Note also that in order to implement this cleanly I had to correct an inconsistency in |
Clearly useful and should be exposed. @JeffBezanson, I thought we had changed this so that the indices in range indexing had to be valid character offsets. Did I misremember that or did we not pull the trigger? |
We didn't pull the trigger last time I raised the IndexError issue in #7811, but that is a separate issue. |
@@ -103,7 +103,7 @@ function getindex(s::UTF8String, r::UnitRange{Int}) | |||
if !is_utf8_start(d[i]) | |||
i = nextind(s,i) | |||
end | |||
if j > endof(s) | |||
if j > length(d) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't length
the number of code points? I think we want sizeof
.
Sorry, it's length(d)
not length(s)
, so this is correct, but confusing.
I keep trying to click re-build PR on this in appveyor (status.julialang.org was down, it's back now) but it doesn't want to work. If you push to this again it should fire off another build. edit: and the latest build timed out due to #7942, but at least appveyor is letting me click rebuild this time |
Now that I look at this more closely, I think that utf8proc's approach of allocating a new string with (This will require us to re-implement utf8proc's grapheme code, but that's actually not such a big deal: we have to do that anyway since utf8proc's grapheme rules are way out of date as described in JuliaStrings/utf8proc#19.) |
@stevengj Would that mean that string iterators could be made to work on graphemes and still be reasonably efficient? (Cf. #6165 (comment) and subsequent comments.) |
The main issue there would be creating a new string object for every iteration, which currently would be very expensive. With the string work I'm doing it could be much cheaper – I presume we can be certain that no grapheme is ever longer than 7 bytes (or 15 bytes on 64-bit platforms)? Still, I'm not sure if this is a good idea. I'd be interested in comparing this to what Go does (they invented UTF-8 after all) – I believe that have something called a "rune" which is not quite a code point. Anyone familiar with this? |
@StefanKarpinski, actually, the With my suggested non-utf8proc strategy, where performance is critical you could also use |
No, but reading http://blog.golang.org/strings it seems that a rune "means exactly the same as "code point", with one interesting addition", which seems to be that See also http://blog.golang.org/normalization for an explanation of iteration and normalization. This page notes that the Norm package provides an iterator over normalized characters. That page also says that the Unicode standard defines a Stream Safe text format, with a limit of 30 codepoints that can be combined into a grapheme, and Go apparently uses that assumption to make some functions more efficient. But the default iterator does not appear to do any normalization. For example, here's the result of iterating over a Unicode string, including the two forms of package main
import "fmt"
func main() {
for i, rune := range "Hello, 世界 noe\u0308l et noël" {
fmt.Printf("%d: %c\n", i, rune)
}
}
0: H
1: e
2: l
3: l
4: o
5: ,
6:
7: 世
10: 界
13:
14: n
15: o
16: e
17: ̈
19: l
20:
21: e
22: t
23:
24: n
25: o
26: ë
28: l I find this a bit unfortunate, but there may be good reasons not to normalize. Though I wonder whether at least normalization couldn't be performed on the fly, even if iteration continues to go over codepoints (runes in Go), and not graphemes. Not sure we'd gain much, if some code points still cannot be considered as "user-perceived characters". |
Yes, Go (@nalimilan, I think it would be possible to normalize on the fly in an iterator, albeit a bit hairy to code, but so far there hasn't been a lot of demand for this...so far only three packages are using |
@stevengj Actually, my vote now goes to require explicit choice of the iteration method. I don't think there's a default which makes more sense than another one, it completely depends on what you want to do. But offering a default is dangerous since it means people don't get to think about what they need and the assumptions they can make. |
9d5dc2b
to
2b44660
Compare
I pushed a new version of this commit that uses the new
|
…ubstrings) of a string s
Anyone understand the AppVeyor failure? It looks like it just died in the middle of the build. |
I believe AppVeyor times out after 40 minutes. |
Yeah it's a freeze/timeout. The earlier failure was some breakage from 9266, now fixed in 9366 so I restarted this and other PR's that had been caught by the problem. Unfortunately we're still getting freezes and timeouts. I'm not sure what the cause of those is, whether it's a Julia problem or the AppVeyor machine is losing its connection or something. Since it only ever happens for 64-bit Julia I suspect it's something in Julia. See http://help.appveyor.com/discussions/problems/1206-inexplicable-timeouts and appveyor/ci#86 - the build usually seems to get stuck either during the system image or the first couple tests, which only takes about 10 minutes to get to. If AppVeyor had a feature like Travis where the build fails if no output is received for some amount of time, we could at least move through our queue faster. |
And it's timing out again. Feodor's looking into it, see if he can find anything. |
Seems like the tests are passing again. Since everyone seems to be in favor of including this functionality and I've heard no objections to the syntax (modulo larger questions about overhauling |
add graphemes(s) function to iterate over string graphemes
Would this be something backported to 0.3? or should I think about freezing TermWin.jl version for 0.3 if I choose to use graphemes? |
This relies on functionality that's only in libmojibake, not upstream utf8proc. Progress is being made on upstreaming our changes, as I understand it. But we haven't backported the dependency change so this would be difficult to backport without also backporting all the other mojibake-related changes. |
Ok, thanks. |
I'm using |
@matthieugomez, I don't understand your question. |
I'm on my phone but first(graphemes("hello")) is ASCIIString not a Substring On Friday, November 6, 2015, Steven G. Johnson notifications@github.com
|
This is consistent with eltype() and therefore collect(). Fixes #9261.
Good catch! See #13903. |
Oh, right; I think I thought that |
Yes so the enumerations generates strings. These strings are converted into On Friday, November 6, 2015, Matthieu Gomez gomez.matthieu@gmail.com
|
Thanks. I have another question. With your type definition, some functions defined for AbstractStrings could directly work on GraphemeIterator, like chr2ind, nextind etc. but currently one needs to redefine all of them. Could GraphemeIterator be an AbstractString? If not could there be a common parent of ABstractStrings and Grapheme iterator? |
@matthieugomez GraphemeIterator sounds very different from an |
@nalimilan Sorry i was not clear, I'll try to develop. The first reason is that Another reason is that they have the same implementation. AbstractStrings and I'm writing this because I'm working on a package that computes various string distances. Most of these distances work by comparing characters. I wrote the package with the idea that a character = a Char. But it'd be nice if the user could also compare strings with the notion that a character = a grapheme. Looking at my code, I just need to sign all methods with a |
I think it would be more logical to consider iterating over characters and iterating over graphemes as two different ways of going over a string. You could define a (Actually, one could even advocate that iterating over a string shouldn't be possible: you would be required to do |
I need more than the iterator interface: I also need functions such as To sum up my point:
So it seems like there's some duplication going on, but I'm not sure what the best solution is. |
I'm trying to follow @nalimilan idea by defining two different iterating wrappers for AbstractString. But it gets hard to rewrite stuff like |
@matthieugomez What do you mean exactly? Are you trying to allow calling |
Yes. Another thing I don't really like with the solution of implementing two custom iterator types is that this will make the code harder to read, cannot be copy pasted etc. It'd be nice to have the grapheme iteration "for free" rather than rewriting everything for new iterator objects rather than string objects. |
I think you misunderstood my suggestion. My proposal was to always work with the standard string, and apply operations ( |
utf8proc/libmojibake includes a feature to break a string up into graphemes (as defined by UAX#29). This PR exports a new function
graphemes(s::AbstractString)
to expose this functionality as an iterator over substrings ofs
.e.g.
collect(graphemes("b̀lahβlahb̂láh"))
yieldsSubString{UTF8String}["b̀","l","a","h","β","l","a","h","b̂","l","á","h"]
. Note that these are strings, not characters, because graphemes may consist of multiple codepoints (e.g."b̀"
is"\u0062\u0300"
in any normalization).(This may be useful e.g. in the REPL if we want the arrow keys etcetera to work with graphemes as recommended by UAX#29.)
cc: @jiahao, @Keno