Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NFKC form incorrectly distinguishes two compatibility-equivalent strings #8

Closed
stedolan opened this issue May 20, 2017 · 2 comments
Closed
Labels

Comments

@stedolan
Copy link

UAX #15 specifies that toNFKC(x) = toNFKC(toNFD(x)) - since x and toNFD(x) are canonically equivalent, they should also be compatibility equivalent and have the same NFKC form.

The sequence:

U+ff80 U+1fd3 U+ff9e U+1fd3

is normalised to form NFD by uunf as:

U+ff80 U+3b9 U+308 U+301 U+ff9e U+3b9 U+308 U+301

and normalised to form NFKC as:

U+30c0 U+390 U+390

However, when the NFD form above is further normalised to form NFKC, uunf produces the following output, which differs from the NFKC form above:

U+30bf U+390 U+3099 U+390

(I found this example using crowbar. The full testcase is here)

@dbuenzli dbuenzli added the bug label May 23, 2017
@stedolan
Copy link
Author

I left the fuzzing job running for longer, and it found some more cases. The simplest one seems to be this:

[U+1c6 U+32d "dž̭"]

when normalised to form NFKD and then to form NFKC gives the correct (I think) NFKC form:

[U+64 U+17e U+32d "dž̭"]

However, if it's normalised to NFKC directly, then the COMBINING CIRCUMFLEX ACCENT BELOW gets incorrectly moved past the z and attaches to the d:

[U+1e13 U+17e "ḓž"]

@dbuenzli
Copy link
Owner

Thanks for the reduction. There was a bug in the (terrible) implementation of the canonical composition algorithm. This was rewritten more cleanly and hopefully correctly in 9459c90

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants