NFKC form incorrectly distinguishes two compatibility-equivalent strings #8

stedolan · 2017-05-20T09:33:40Z

UAX #15 specifies that toNFKC(x) = toNFKC(toNFD(x)) - since x and toNFD(x) are canonically equivalent, they should also be compatibility equivalent and have the same NFKC form.

The sequence:

U+ff80 U+1fd3 U+ff9e U+1fd3

is normalised to form NFD by uunf as:

U+ff80 U+3b9 U+308 U+301 U+ff9e U+3b9 U+308 U+301

and normalised to form NFKC as:

U+30c0 U+390 U+390

However, when the NFD form above is further normalised to form NFKC, uunf produces the following output, which differs from the NFKC form above:

U+30bf U+390 U+3099 U+390

(I found this example using crowbar. The full testcase is here)

The text was updated successfully, but these errors were encountered:

stedolan · 2017-05-24T13:20:51Z

I left the fuzzing job running for longer, and it found some more cases. The simplest one seems to be this:

[U+1c6 U+32d "ǆ̭"]

when normalised to form NFKD and then to form NFKC gives the correct (I think) NFKC form:

[U+64 U+17e U+32d "dž̭"]

However, if it's normalised to NFKC directly, then the COMBINING CIRCUMFLEX ACCENT BELOW gets incorrectly moved past the z and attaches to the d:

[U+1e13 U+17e "ḓž"]

dbuenzli · 2017-05-25T14:50:35Z

Thanks for the reduction. There was a bug in the (terrible) implementation of the canonical composition algorithm. This was rewritten more cleanly and hopefully correctly in 9459c90

dbuenzli added the bug label May 23, 2017

stedolan mentioned this issue May 24, 2017

Bugs found with Crowbar stedolan/crowbar#2

Open

dbuenzli closed this as completed in 9459c90 May 25, 2017

stedolan mentioned this issue May 26, 2017

Normalisation bug with Hangul / symbols sequence #10

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NFKC form incorrectly distinguishes two compatibility-equivalent strings #8

NFKC form incorrectly distinguishes two compatibility-equivalent strings #8

stedolan commented May 20, 2017

stedolan commented May 24, 2017

dbuenzli commented May 25, 2017

NFKC form incorrectly distinguishes two compatibility-equivalent strings #8

NFKC form incorrectly distinguishes two compatibility-equivalent strings #8

Comments

stedolan commented May 20, 2017

stedolan commented May 24, 2017

dbuenzli commented May 25, 2017