`collect(graphemes("🤦🏼‍♂️"))` results in ERROR: ArgumentError: destination has fewer elements than required #37680

kdheepak · 2020-09-21T13:16:17Z

I believe there's an issue with the implementation of iterate on the result of graphemes("🤦🏼‍♂️").

The following works as expected:

julia> s = "🤦🏼‍♂️"
"🤦🏼\u200d♂️"

julia> length(s) == 5
true

julia> [s[i] for i in eachindex(s)]
5-element Array{Char,1}:
 '🤦': Unicode U+1F926 (category So: Symbol, other)
 '🏼': Unicode U+1F3FC (category Sk: Symbol, modifier)
 '\u200d': Unicode U+200D (category Cf: Other, format)
 '♂': Unicode U+2642 (category So: Symbol, other)
 '️': Unicode U+FE0F (category Mn: Mark, nonspacing)

julia> using Unicode

julia> length(graphemes(s)) == 1
true

However, this is the error I get when I try to collect the result of the Iterator in the latest stable release of Julia (v1.5.1).

julia> collect(graphemes(s))
ERROR: ArgumentError: destination has fewer elements than required
Stacktrace:
 [1] copyto!(::Array{SubString{String},1}, ::Base.Unicode.GraphemeIterator{String}) at ./abstractarray.jl:734
 [2] _collect at ./array.jl:630 [inlined]
 [3] collect(::Base.Unicode.GraphemeIterator{String}) at ./array.jl:624
 [4] top-level scope at REPL[5]:1

The implementation of length appears to be correct, but I think there's a bug in the implementation of iterate:

julia> for (i, c) in enumerate(graphemes(s))
       println(i, " ", c)
       end
1 🤦🏼‍
2 ♂️

I believe the correct behavior in this case is to return just the first element. iterate seems to be returning the Symbol ♂ as well.

The text was updated successfully, but these errors were encountered:

simeonschaub · 2020-09-21T14:30:34Z

I can reproduce this on master.

simeonschaub · 2020-09-21T15:05:54Z

I think the problem is that the state passed to isgraphemebreak! needs to be initialized on the first iteration with the null character. The following fixes the issue for me. Is this the correct way to do it?

diff --git a/base/strings/unicode.jl b/base/strings/unicode.jl
index 235f85184d..029787b3c1 100644
--- a/base/strings/unicode.jl
+++ b/base/strings/unicode.jl
@@ -671,7 +671,7 @@ function length(g::GraphemeIterator{S}) where {S}
     return n
 end
 
-function iterate(g::GraphemeIterator, i_=(Int32(0),firstindex(g.s)))
+function iterate(g::GraphemeIterator{S}, i_=(Int32(0),firstindex(g.s))) where {S}
     s = g.s
     statei, i = i_
     state = Ref{Int32}(statei)
@@ -679,6 +679,7 @@ function iterate(g::GraphemeIterator, i_=(Int32(0),firstindex(g.s)))
     y = iterate(s, i)
     y === nothing && return nothing
     c0, k = y
+    i == firstindex(g.s) && isgraphemebreak!(state, eltype(S)(0x00000000), c0)
     while k <= ncodeunits(s) # loop until next grapheme is s[i:j]
         c, ℓ = iterate(s, k)
         isgraphemebreak!(state, c0, c) && break

Keno · 2020-09-21T17:27:29Z

I don't think that should be required. If it is, that potentially seems like a bug in utf8proc

simeonschaub · 2020-09-21T17:36:26Z

No, it seems to be documented: https://juliastrings.github.io/utf8proc/doc/utf8proc_8h.html#aae83bdcabf3a97c1046c0700ba353640. Apparently, we don't need to pass the state around though, we can just always set it to zero initially.

Keno · 2020-09-21T17:42:58Z

I wrote that documentation :). Yes, we don't need to pass the state around. Setting it to zero is fine. If that's required though that's a bug in utf8proc where it should reset the state by itself.

simeonschaub · 2020-09-21T17:50:58Z

I just thought this was intended because that's how it's done in length(::GraphemeIterator), but perhaps that was just a workaround.

Keno · 2020-09-21T17:53:46Z

It's definitely not supposed to be needed because eltype(S)(0x00000000) is not a character that occurs in the source. Something else is wrong (maybe a missing reset of the state in utf8proc). And of course it should be refactored to not need to useless carry around the state.

kdheepak · 2020-09-22T18:08:16Z

Thanks for the comment @Keno and @simeonschaub for looking into this!

I am going through the source of utf8proc.c and utf8proc.h to see if I can narrow it down.

UAX#29 defines the grapheme cluster boundary rules in this section. Specifically there's this rule: http://www.unicode.org/reports/tr29/#GB11

Do not break within emoji modifier 					sequences or emoji zwj sequences.
--
GB11 | \p{Extended_Pictographic} 						Extend* ZWJ | × | \p{Extended_Pictographic}

These lines in grapheme_break_simple manage the rule for this:

(lbc == UTF8PROC_BOUNDCLASS_E_ZWG &&              // GB11 (requires additional handling below)
 tbc == UTF8PROC_BOUNDCLASS_EXTENDED_PICTOGRAPHIC) ? false : // ----

My understanding is that this is implemented as a state machine.
The grapheme_break_simple function that contains the code above is called from grapheme_break_extended which manages the state machine.

I think it is this part of the "state machine" that is causing the issue.

else if (*state == UTF8PROC_BOUNDCLASS_EXTENDED_PICTOGRAPHIC) {
  if (tbc == UTF8PROC_BOUNDCLASS_EXTEND) // fold EXTEND codepoints into emoji
    *state = UTF8PROC_BOUNDCLASS_EXTENDED_PICTOGRAPHIC;
  else if (tbc == UTF8PROC_BOUNDCLASS_ZWJ)
    *state = UTF8PROC_BOUNDCLASS_E_ZWG; // state to record emoji+zwg combo
  else
    *state = tbc;
}

I did some print style debugging to figure out what is going on.

Here is the state machine transition in the following format:

tbc (tbc -> boundclass) : from_state -> to_state = break_permitted

'🏼'     (UTF8PROC_BOUNDCLASS_EXTEND)                : UTF8PROC_BOUNDCLASS_START                 -> UTF8PROC_BOUNDCLASS_EXTEND                = false
'\u200d' (UTF8PROC_BOUNDCLASS_ZWJ)                   : UTF8PROC_BOUNDCLASS_EXTEND                -> UTF8PROC_BOUNDCLASS_ZWJ                   = false
'♂'      (UTF8PROC_BOUNDCLASS_EXTENDED_PICTOGRAPHIC) : UTF8PROC_BOUNDCLASS_ZWJ                   -> UTF8PROC_BOUNDCLASS_EXTENDED_PICTOGRAPHIC = true
'️'       (UTF8PROC_BOUNDCLASS_EXTEND)                : UTF8PROC_BOUNDCLASS_EXTENDED_PICTOGRAPHIC -> UTF8PROC_BOUNDCLASS_EXTENDED_PICTOGRAPHIC = false

You can see that when the boundclass is UTF8PROC_BOUNDCLASS_ZWJ and the state goes from UTF8PROC_BOUNDCLASS_EXTEND to UTF8PROC_BOUNDCLASS_ZWJ. I believe this state should be UTF8PROC_BOUNDCLASS_E_ZWG instead. The current implementation of the state machine "loses" the information that it was part of an extending grapheme cluster.

I made the following change

    if (*state == tbc && tbc == UTF8PROC_BOUNDCLASS_REGIONAL_INDICATOR)
      *state = UTF8PROC_BOUNDCLASS_OTHER;
    // Special support for GB11 (emoji extend* zwj / emoji)
    else if (*state == UTF8PROC_BOUNDCLASS_EXTENDED_PICTOGRAPHIC) {
      if (tbc == UTF8PROC_BOUNDCLASS_EXTEND) // fold EXTEND codepoints into emoji
        *state = UTF8PROC_BOUNDCLASS_EXTENDED_PICTOGRAPHIC;
      else if (tbc == UTF8PROC_BOUNDCLASS_ZWJ)
        *state = UTF8PROC_BOUNDCLASS_E_ZWG; // state to record emoji+zwg combo
      else
        *state = tbc;
    }
+   else if (*state == UTF8PROC_BOUNDCLASS_EXTEND && tbc == UTF8PROC_BOUNDCLASS_ZWJ)
+       *state = UTF8PROC_BOUNDCLASS_E_ZWG;       
    else
      *state = tbc;

And now the state machine goes from UTF8PROC_BOUNDCLASS_EXTEND to UTF8PROC_BOUNDCLASS_E_ZWG.

'🏼'     (UTF8PROC_BOUNDCLASS_EXTEND)                : UTF8PROC_BOUNDCLASS_START                 -> UTF8PROC_BOUNDCLASS_EXTEND                = false
'\u200d' (UTF8PROC_BOUNDCLASS_ZWJ)                   : UTF8PROC_BOUNDCLASS_EXTEND                -> UTF8PROC_BOUNDCLASS_E_ZWG                 = false
'♂'      (UTF8PROC_BOUNDCLASS_EXTENDED_PICTOGRAPHIC) : UTF8PROC_BOUNDCLASS_E_ZWG                 -> UTF8PROC_BOUNDCLASS_EXTENDED_PICTOGRAPHIC = false
'️'       (UTF8PROC_BOUNDCLASS_EXTEND)                : UTF8PROC_BOUNDCLASS_EXTENDED_PICTOGRAPHIC -> UTF8PROC_BOUNDCLASS_EXTENDED_PICTOGRAPHIC = false

I ran make check in the utf8proc folder after making this change and the tests pass.

Edit: Ignore the text below in this comment. See next comment for more information.

~~However, I am still getting behavior that doesn't make sense.~~

Specifically, there's this rule that says:

Break at the start and end of text, unless the text is empty.

I don't understand why the specification says "break at the start" or what "unless the text is empty" means. What "text" is being referred to here?

This is the implementation for this rule:

(lbc == UTF8PROC_BOUNDCLASS_START) ? true :       // GB1

This returns true at every UTF8PROC_BOUNDCLASS_START. Is that a correct? Maybe I'm misinterpreting the specification but that doesn't seem like what it should do, right?

I changed that line to the following:

(lbc == UTF8PROC_BOUNDCLASS_START) ? false

This is obviously a violation of the specification and with this change make check also fails.

But with this change and with the fix to the state machine that I mentioned above I'm getting the behavior I expect. This also seems to work for every other case I experiment with. I didn't run a full test suite though and this is my first foray into the Unicode specification and I've only read this section.

Any suggestions here would be very welcome.

kdheepak · 2020-09-22T19:29:12Z

I took a second stab at this just to explore again what is going on. It turns out I had a bug in my implementation! (I had re-implemented the state machine in Julia for my testing and messed up one line there.)

This is the only change I needed to make to resolve this issue, exactly the same as the previous comment.

    if (*state == tbc && tbc == UTF8PROC_BOUNDCLASS_REGIONAL_INDICATOR)
      *state = UTF8PROC_BOUNDCLASS_OTHER;
    // Special support for GB11 (emoji extend* zwj / emoji)
    else if (*state == UTF8PROC_BOUNDCLASS_EXTENDED_PICTOGRAPHIC) {
      if (tbc == UTF8PROC_BOUNDCLASS_EXTEND) // fold EXTEND codepoints into emoji
        *state = UTF8PROC_BOUNDCLASS_EXTENDED_PICTOGRAPHIC;
      else if (tbc == UTF8PROC_BOUNDCLASS_ZWJ)
        *state = UTF8PROC_BOUNDCLASS_E_ZWG; // state to record emoji+zwg combo
      else
        *state = tbc;
    }
+   else if (*state == UTF8PROC_BOUNDCLASS_EXTEND && tbc == UTF8PROC_BOUNDCLASS_ZWJ)
+       *state = UTF8PROC_BOUNDCLASS_E_ZWG;       
    else
      *state = tbc;

Ignore the part about the UTF8PROC_BOUNDCLASS_START state from the previous comment.

The only question I have is what other states should transition to UTF8PROC_BOUNDCLASS_E_ZWG?

Using fix proposed here JuliaLang/julia#37680

archermarx · 2020-11-19T03:12:09Z

I've been experimenting with this to solve a similar problem, and it seems that all of these cases are only broken if the emoji cluster appears at the beginning of the string

For example:

julia> collect(graphemes("🤦🏼‍♂️"))
ERROR: ArgumentError: destination has fewer elements than required
Stacktrace:
 [1] copyto!(::Array{SubString{String},1}, ::Base.Unicode.GraphemeIterator{String}) at ./abstractarray.jl:734
 [2] _collect at ./array.jl:630 [inlined]
 [3] collect(::Base.Unicode.GraphemeIterator{String}) at ./array.jl:624
 [4] top-level scope at REPL[3]:1

julia> collect(graphemes("a" * "🤦🏼‍♂️"))
2-element Array{SubString{String},1}:
 "a"
 "🤦🏼\u200d♂️"

julia> collect(graphemes("🇸🇪🇸🇪"))
2-element Array{SubString{String},1}:
 "🇸🇪🇸"
 "🇪"

julia> collect(graphemes("a"*"🇸🇪🇸🇪"))
3-element Array{SubString{String},1}:
 "a"
 "🇸🇪"
 "🇸🇪"

julia> collect(graphemes("👨🏻‍🤝‍👨🏽"))
ERROR: ArgumentError: destination has fewer elements than required
Stacktrace:
 [1] copyto!(::Array{SubString{String},1}, ::Base.Unicode.GraphemeIterator{String}) at ./abstractarray.jl:734
 [2] _collect at ./array.jl:630 [inlined]
 [3] collect(::Base.Unicode.GraphemeIterator{String}) at ./array.jl:624
 [4] top-level scope at REPL[4]:1

julia> collect(graphemes("a"*"👨🏻‍🤝‍👨🏽"))
2-element Array{SubString{String},1}:
 "a"
 "👨🏻\u200d🤝\u200d👨🏽"

kdheepak · 2020-11-19T03:19:19Z

Thanks for looking into it! Does it fail even with the patch?

it seems that all of these cases are only broken if the emoji cluster appears at the beginning of the string

It probably enters the state machine into this incorrect state only in some cases.

archermarx · 2020-11-19T03:33:53Z

Just did some more tests. The patch fixes "🤦🏼‍♂️" and ""👨🏻‍🤝‍👨🏽" but not "🇸🇪🇸🇪", while prepending a letter fixes all three. There must be something to do with the initialization of the state

kdheepak · 2020-11-19T04:19:57Z

I have a patch for this that fixes the two (or multiple flags) case 🇸🇪🇸🇪:

diff --git a/utf8proc.c b/utf8proc.c
index 6591976..086f430 100644
--- a/utf8proc.c
+++ b/utf8proc.c
@@ -301,6 +301,8 @@ static utf8proc_bool grapheme_break_extended(int lbc, int tbc, utf8proc_int32_t
     // forbidden by a different rule such as GB9).
     if (*state == tbc && tbc == UTF8PROC_BOUNDCLASS_REGIONAL_INDICATOR)
       *state = UTF8PROC_BOUNDCLASS_OTHER;
+    else if (*state == UTF8PROC_BOUNDCLASS_START && lbc == UTF8PROC_BOUNDCLASS_REGIONAL_INDICATOR && tbc == UTF8PROC_BOUNDCLASS_REGIONAL_INDICATOR)
+      *state = UTF8PROC_BOUNDCLASS_OTHER;
     // Special support for GB11 (emoji extend* zwj / emoji)
     else if (*state == UTF8PROC_BOUNDCLASS_EXTENDED_PICTOGRAPHIC) {
       if (tbc == UTF8PROC_BOUNDCLASS_EXTEND) // fold EXTEND codepoints into emoji

This patch special cases when there's a regional indicator in the start.

archermarx · 2020-11-19T04:42:27Z

Oh sick. I can add that to my fork and submit a PR. I've got some additional tests added to the graphemetest.c file

kdheepak · 2020-11-19T05:07:43Z

Thanks for pushing it along!

stevengj · 2020-11-23T17:38:28Z

Update: the underlying bug here was that the initial state = 0 needs to be converted internally to the bound class of the first character. That was already done by utf8proc_map, which is why utf8proc passed the Unicode grapheme tests previously, but it wasn't being done when you called isgraphemebreak! (utf8proc_grapheme_break_stateful) manually. (Our updated tests now check both APIs.)

Once the tests on the latest utf8proc PR are green, I'll merge and then post a PR to make Julia use the updated utf8proc.

JeffBezanson added bug Indicates an unexpected problem or unintended behavior strings "Strings!" labels Sep 21, 2020

stevengj added the unicode Related to unicode characters and encodings label Sep 22, 2020

archermarx mentioned this issue Nov 19, 2020

Add isemoji function to Unicode stdlib and export it #38458

Closed

archermarx added a commit to archermarx/utf8proc that referenced this issue Nov 19, 2020

Fix grapheme handling for extended+ zwj emoji

ff480f1

Using fix proposed here JuliaLang/julia#37680

archermarx mentioned this issue Nov 19, 2020

Fix grapheme breaks on string-initial repeated regional flags and emoji with extended+zwj sequence JuliaStrings/utf8proc#204

Closed

stevengj mentioned this issue Nov 23, 2020

Fix grapheme breaks on string-initial JuliaStrings/utf8proc#205

Merged

stevengj mentioned this issue Nov 23, 2020

update to utf8proc 2.6 #38551

Merged

Keno closed this as completed in #38551 Nov 25, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`collect(graphemes("🤦🏼‍♂️"))` results in ERROR: ArgumentError: destination has fewer elements than required #37680

`collect(graphemes("🤦🏼‍♂️"))` results in ERROR: ArgumentError: destination has fewer elements than required #37680

kdheepak commented Sep 21, 2020 •

edited

Loading

simeonschaub commented Sep 21, 2020

simeonschaub commented Sep 21, 2020 •

edited by fredrikekre

Loading

Keno commented Sep 21, 2020

simeonschaub commented Sep 21, 2020

Keno commented Sep 21, 2020

simeonschaub commented Sep 21, 2020

Keno commented Sep 21, 2020

kdheepak commented Sep 22, 2020 •

edited

Loading

kdheepak commented Sep 22, 2020 •

edited

Loading

archermarx commented Nov 19, 2020 •

edited

Loading

kdheepak commented Nov 19, 2020 •

edited

Loading

archermarx commented Nov 19, 2020 •

edited

Loading

kdheepak commented Nov 19, 2020 •

edited

Loading

archermarx commented Nov 19, 2020

kdheepak commented Nov 19, 2020

stevengj commented Nov 23, 2020 •

edited

Loading

collect(graphemes("🤦🏼‍♂️")) results in ERROR: ArgumentError: destination has fewer elements than required #37680

collect(graphemes("🤦🏼‍♂️")) results in ERROR: ArgumentError: destination has fewer elements than required #37680

Comments

kdheepak commented Sep 21, 2020 • edited Loading

simeonschaub commented Sep 21, 2020

simeonschaub commented Sep 21, 2020 • edited by fredrikekre Loading

Keno commented Sep 21, 2020

simeonschaub commented Sep 21, 2020

Keno commented Sep 21, 2020

simeonschaub commented Sep 21, 2020

Keno commented Sep 21, 2020

kdheepak commented Sep 22, 2020 • edited Loading

kdheepak commented Sep 22, 2020 • edited Loading

archermarx commented Nov 19, 2020 • edited Loading

kdheepak commented Nov 19, 2020 • edited Loading

archermarx commented Nov 19, 2020 • edited Loading

kdheepak commented Nov 19, 2020 • edited Loading

archermarx commented Nov 19, 2020

kdheepak commented Nov 19, 2020

stevengj commented Nov 23, 2020 • edited Loading

`collect(graphemes("🤦🏼‍♂️"))` results in ERROR: ArgumentError: destination has fewer elements than required #37680

`collect(graphemes("🤦🏼‍♂️"))` results in ERROR: ArgumentError: destination has fewer elements than required #37680

kdheepak commented Sep 21, 2020 •

edited

Loading

simeonschaub commented Sep 21, 2020 •

edited by fredrikekre

Loading

kdheepak commented Sep 22, 2020 •

edited

Loading

kdheepak commented Sep 22, 2020 •

edited

Loading

archermarx commented Nov 19, 2020 •

edited

Loading

kdheepak commented Nov 19, 2020 •

edited

Loading

archermarx commented Nov 19, 2020 •

edited

Loading

kdheepak commented Nov 19, 2020 •

edited

Loading

stevengj commented Nov 23, 2020 •

edited

Loading