Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid splitting surrogate pairs in StringGenerator #40

Open
vlsi opened this issue Sep 16, 2019 · 2 comments
Open

Avoid splitting surrogate pairs in StringGenerator #40

vlsi opened this issue Sep 16, 2019 · 2 comments

Comments

@vlsi
Copy link

vlsi commented Sep 16, 2019

I was about to test my Gradle PR, and unfortunately StringGenerator fails to support happy smiles :(

For instance:

    @Unroll
    def randomManifest() {
        when:
        println(string.length() + " " + string)

        then:
        1==1

        where:
        string << attributeValue().take(200)
    }

    private static def attributeValue() {
        new StringGenerator(1, 5, '😃')
    }

produces:

3 ???
2 😃
3 ?😃
1 ?
1 ?
1 ?
3 ???
5 ?😃??

In case you are not very familiar with surrogate pairs:

  1. Sometimes java two consequent char values to represent a single value. That is called a codepoint. For instance, 😃 is 2 chars (high surrogate followed by a low surrogate), 1 codepoint.
  2. It is illegal to split the pair. StringGenerator picks individual char values, thus it effectively splits the pair, and it causes bad strings being generated.

So at minimum, StringGenerator should verify if a char points to a pair (e.g. use String#codePointAt), and it should treat two chars as a single unit. That would make spock-genesis to support items like 😃and 💩

  1. There are cases when multiple code points produce a combined glyph.
    For instance, ि followed by न produces नि
    That is not a surrogate pair, so it is "legal" to split those chars, however splitting those would affect how the thing is printed.

I'm sure you've seen https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 , and the madness like Rege̿̔̉x is exactly that. That is letter e surrounded by extra feature that produces e with lots of accent marks.

If I pass that super-e as new StringGenerator(1, 5, 'ue̿̔̉'), then the following result is produced (note how certain marks climb over u):

5 e̿̔e̔
3 uu̔
3 ẻu
2 ̿̔
1 ̔
2 u̔
2 ̿̿
1 ̿
3 ̉e̿

To handle that one might use BreakIterator.
Here you go:

BreakIterator bi = BreakIterator.getCharacterInstance(Locale.ENGLISH)
def text = "Rege̿̔̉x😃नि"
bi.setText(text)
int boundary = bi.first();
while (true) {
    int nextBoundary = bi.next();
    if (nextBoundary == BreakIterator.DONE) {
        break;
    }
    System.out.println("[$boundary..$nextBoundary), length: ${nextBoundary - boundary}: " + text.substring(boundary, nextBoundary))
    boundary = nextBoundary
}

produces (you can see that fancy-e consumes 4 chars)

[0..1), length: 1: R
[1..2), length: 1: e
[2..3), length: 1: g
[3..7), length: 4: e̿̔̉
[7..8), length: 1: x
[8..10), length: 2: 😃
[10..12), length: 2: नि

WDYT?

@vlsi
Copy link
Author

vlsi commented Sep 16, 2019

Just in case you wondered:

an​*̶͑̾̾​̅ͫ͏̙̤g͇̫͛͆̾ͫ̑͆l͖͉̗̩̳̟̍ͫͥͨe̠̅s is split by BreakIterator as follows:

[0..1), length: 1: a
[1..2), length: 1: n
[2..3), length: 1: ​
[3..8), length: 5: *̶͑̾̾
[8..9), length: 1: ​
[9..10), length: 1: ̅
[10..11), length: 1: ͫ
[11..12), length: 1: ͏
[12..13), length: 1: ̙
[13..14), length: 1: ̤
[14..23), length: 9: g͇̫͛͆̾ͫ̑͆
[23..34), length: 11: l͖͉̗̩̳̟̍ͫͥͨ
[34..37), length: 3: e̠̅
[37..38), length: 1: s

destro҉ying is split as

[0..1), length: 1: d
[1..2), length: 1: e
[2..3), length: 1: s
[3..4), length: 1: t
[4..5), length: 1: r
[5..7), length: 2: o҉
[7..8), length: 1: y
[8..9), length: 1: i
[9..10), length: 1: n
[10..11), length: 1: g

rè̑ͧ̌aͨl̘̝̙̃ͤ͂̾̆ is split as

[0..1), length: 1: r
[1..6), length: 5: è̑ͧ̌
[6..8), length: 2: aͨ
[8..17), length: 9: l̘̝̙̃ͤ͂̾̆

PS. The outputs are produced by OpenJDK 11

So it looks like BreakIterator is quite good at identifying character boundaries of the fancy strings.

@vlsi
Copy link
Author

vlsi commented Sep 16, 2019

The 'sad' thing is BreakIterator is Locale-dependent (see https://docs.oracle.com/javase/tutorial/i18n/text/char.html ).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant