Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix grapheme breaks on string-initial repeated regional flags and emoji with extended+zwj sequence #204

Closed
wants to merge 0 commits into from

Conversation

archermarx
Copy link
Contributor

@archermarx archermarx commented Nov 19, 2020

Fixes #203. JuliaLang/julia#37680 will also be fixed once Julia updates utf8proc.

@kdheepak wrote the patches, I added some tests

utf8proc.c Outdated Show resolved Hide resolved
utf8proc.c Outdated Show resolved Hide resolved
utf8proc.c Outdated Show resolved Hide resolved
results_family[i] = utf8proc_grapheme_break_stateful(c1, c2, &state);
check(results_family[i] == expected_family[i], "Incorrect grapheme break on initial extended + zwj emoji");
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also check that utf8proc_map(utf8, j, &g, UTF8PROC_CHARBOUND); works for these cases?

Conversely, we should check that manually calling utf8proc_grapheme_break_stateful works with the previous GraphemeBreakTest.txt testcases in addition to utf8proc_map.

@stevengj
Copy link
Member

stevengj commented Nov 21, 2020

It seems like a better fix would be:

static utf8proc_bool grapheme_break_extended(int lbc, int tbc, utf8proc_int32_t *state)
{
    int lbc_override;
    if (*state == UTF8PROC_BOUNDCLASS_START)
        *state = lbc; 
    else
        lbc_override = *state;
    ...
    as before
    ...
}

Isn't that really the underlying bug, that *state isn't being initialized correctly for the second codepoint since grapheme_break_extended is never called for the first codepoint?

@stevengj stevengj mentioned this pull request Nov 21, 2020
@kdheepak
Copy link

Wouldn't lbc_override be uninitialized if *state == UTF8PROC_BOUNDCLASS_START? Do you mean this instead?

    int lbc_override = lbc;
    if (*state == UTF8PROC_BOUNDCLASS_START)
        *state = lbc; 
    else
        lbc_override = *state;

Thanks for the feedback. My initial patch was based on a limited understanding of what is going on in utfproc. Feel free to suggest any recommendations you have.

I'm testing this now to see if your fix works.

@kdheepak
Copy link

I checked out the latest master from this repo and added the following patch

diff --git a/utf8proc.c b/utf8proc.c
index 6591976..00abc2c 100644
--- a/utf8proc.c
+++ b/utf8proc.c
@@ -290,8 +290,11 @@ static utf8proc_bool grapheme_break_simple(int lbc, int tbc) {
 
 static utf8proc_bool grapheme_break_extended(int lbc, int tbc, utf8proc_int32_t *state)
 {
-  int lbc_override = ((state && *state != UTF8PROC_BOUNDCLASS_START)
-                      ? *state : lbc);
+  int lbc_override = lbc;
+  if (*state == UTF8PROC_BOUNDCLASS_START)
+      *state = lbc;
+  else
+      lbc_override = *state;
   utf8proc_bool break_permitted = grapheme_break_simple(lbc_override, tbc);
   if (state) {
     // Special support for GB 12/13 made possible by GB999. After two RI

along with @archermarx's tests from this PR, and all the tests pass.

@stevengj
Copy link
Member

stevengj commented Nov 21, 2020

Yes, sorry:

    int lbc_override;
    if (*state == UTF8PROC_BOUNDCLASS_START)
        *state = lbc_override = lbc; 
    else
        lbc_override = *state;

@stevengj
Copy link
Member

I updated the PR to use the new fix.

@archermarx
Copy link
Contributor Author

Excellent!

@stevengj
Copy link
Member

Whoops, I accidentally overwrite your master with my master, rather than with my branch.

I re-opened as #205, with updated tests so that it also tests the utf8proc_map interface. Will merge once tests are green.

@archermarx
Copy link
Contributor Author

No worries!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Grapheme breaks
3 participants