You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In case you are not very familiar with surrogate pairs:
Sometimes java two consequent char values to represent a single value. That is called a codepoint. For instance, 😃 is 2 chars (high surrogate followed by a low surrogate), 1 codepoint.
It is illegal to split the pair. StringGenerator picks individual char values, thus it effectively splits the pair, and it causes bad strings being generated.
So at minimum, StringGenerator should verify if a char points to a pair (e.g. use String#codePointAt), and it should treat two chars as a single unit. That would make spock-genesis to support items like 😃and 💩
There are cases when multiple code points produce a combined glyph.
For instance, ि followed by न produces नि
That is not a surrogate pair, so it is "legal" to split those chars, however splitting those would affect how the thing is printed.
[0..1), length: 1: d
[1..2), length: 1: e
[2..3), length: 1: s
[3..4), length: 1: t
[4..5), length: 1: r
[5..7), length: 2: o҉
[7..8), length: 1: y
[8..9), length: 1: i
[9..10), length: 1: n
[10..11), length: 1: g
I was about to test my Gradle PR, and unfortunately StringGenerator fails to support happy smiles :(
For instance:
produces:
In case you are not very familiar with surrogate pairs:
char
values to represent a single value. That is called a codepoint. For instance, 😃 is 2 chars (high surrogate followed by a low surrogate), 1 codepoint.StringGenerator
picks individualchar
values, thus it effectively splits the pair, and it causes bad strings being generated.So at minimum,
StringGenerator
should verify if a char points to a pair (e.g. useString#codePointAt
), and it should treat two chars as a single unit. That would make spock-genesis to support items like 😃and 💩For instance, ि followed by न produces नि
That is not a surrogate pair, so it is "legal" to split those chars, however splitting those would affect how the thing is printed.
I'm sure you've seen https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 , and the madness like Rege̿̔̉x is exactly that. That is letter
e
surrounded by extra feature that producese
with lots of accent marks.If I pass that
super-e
asnew StringGenerator(1, 5, 'ue̿̔̉')
, then the following result is produced (note how certain marks climb overu
):To handle that one might use
BreakIterator
.Here you go:
produces (you can see that
fancy-e
consumes 4 chars)WDYT?
The text was updated successfully, but these errors were encountered: