Avoid splitting surrogate pairs in StringGenerator #40

vlsi · 2019-09-16T11:12:43Z

I was about to test my Gradle PR, and unfortunately StringGenerator fails to support happy smiles :(

For instance:

    @Unroll
    def randomManifest() {
        when:
        println(string.length() + " " + string)

        then:
        1==1

        where:
        string << attributeValue().take(200)
    }

    private static def attributeValue() {
        new StringGenerator(1, 5, '😃')
    }

produces:

3 ???
2 😃
3 ?😃
1 ?
1 ?
1 ?
3 ???
5 ?😃??

In case you are not very familiar with surrogate pairs:

Sometimes java two consequent char values to represent a single value. That is called a codepoint. For instance, 😃 is 2 chars (high surrogate followed by a low surrogate), 1 codepoint.
It is illegal to split the pair. StringGenerator picks individual char values, thus it effectively splits the pair, and it causes bad strings being generated.

So at minimum, StringGenerator should verify if a char points to a pair (e.g. use String#codePointAt), and it should treat two chars as a single unit. That would make spock-genesis to support items like 😃and 💩

There are cases when multiple code points produce a combined glyph.
For instance, ि followed by न produces नि
That is not a surrogate pair, so it is "legal" to split those chars, however splitting those would affect how the thing is printed.

I'm sure you've seen https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 , and the madness like Rege̿̔̉x is exactly that. That is letter e surrounded by extra feature that produces e with lots of accent marks.

If I pass that super-e as new StringGenerator(1, 5, 'ue̿̔̉'), then the following result is produced (note how certain marks climb over u):

5 e̿̔e̔
3 uu̔
3 ẻu
2 ̿̔
1 ̔
2 u̔
2 ̿̿
1 ̿
3 ̉e̿

To handle that one might use BreakIterator.
Here you go:

BreakIterator bi = BreakIterator.getCharacterInstance(Locale.ENGLISH)
def text = "Rege̿̔̉x😃नि"
bi.setText(text)
int boundary = bi.first();
while (true) {
    int nextBoundary = bi.next();
    if (nextBoundary == BreakIterator.DONE) {
        break;
    }
    System.out.println("[$boundary..$nextBoundary), length: ${nextBoundary - boundary}: " + text.substring(boundary, nextBoundary))
    boundary = nextBoundary
}

produces (you can see that fancy-e consumes 4 chars)

[0..1), length: 1: R
[1..2), length: 1: e
[2..3), length: 1: g
[3..7), length: 4: e̿̔̉
[7..8), length: 1: x
[8..10), length: 2: 😃
[10..12), length: 2: नि

WDYT?

The text was updated successfully, but these errors were encountered:

vlsi · 2019-09-16T11:23:46Z

Just in case you wondered:

an*̶͑̾̾̅ͫ͏̙̤g͇̫͛͆̾ͫ̑͆l͖͉̗̩̳̟̍ͫͥͨe̠̅s is split by BreakIterator as follows:

[0..1), length: 1: a
[1..2), length: 1: n
[2..3), length: 1:
[3..8), length: 5: *̶͑̾̾
[8..9), length: 1:
[9..10), length: 1: ̅
[10..11), length: 1: ͫ
[11..12), length: 1: ͏
[12..13), length: 1: ̙
[13..14), length: 1: ̤
[14..23), length: 9: g͇̫͛͆̾ͫ̑͆
[23..34), length: 11: l͖͉̗̩̳̟̍ͫͥͨ
[34..37), length: 3: e̠̅
[37..38), length: 1: s

destro҉ying is split as

[0..1), length: 1: d
[1..2), length: 1: e
[2..3), length: 1: s
[3..4), length: 1: t
[4..5), length: 1: r
[5..7), length: 2: o҉
[7..8), length: 1: y
[8..9), length: 1: i
[9..10), length: 1: n
[10..11), length: 1: g

rè̑ͧ̌aͨl̘̝̙̃ͤ͂̾̆ is split as

[0..1), length: 1: r
[1..6), length: 5: è̑ͧ̌
[6..8), length: 2: aͨ
[8..17), length: 9: l̘̝̙̃ͤ͂̾̆

PS. The outputs are produced by OpenJDK 11

So it looks like BreakIterator is quite good at identifying character boundaries of the fancy strings.

vlsi · 2019-09-16T14:41:09Z

The 'sad' thing is BreakIterator is Locale-dependent (see https://docs.oracle.com/javase/tutorial/i18n/text/char.html ).

vlsi mentioned this issue Sep 16, 2019

Improve Manifset creation gradle/gradle#10724

Closed

16 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid splitting surrogate pairs in StringGenerator #40

Avoid splitting surrogate pairs in StringGenerator #40

vlsi commented Sep 16, 2019

vlsi commented Sep 16, 2019

vlsi commented Sep 16, 2019

Avoid splitting surrogate pairs in StringGenerator #40

Avoid splitting surrogate pairs in StringGenerator #40

Comments

vlsi commented Sep 16, 2019

vlsi commented Sep 16, 2019

vlsi commented Sep 16, 2019