Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

normalization does not commute with case-folding? #257

Closed
stevengj opened this issue Dec 8, 2023 · 2 comments
Closed

normalization does not commute with case-folding? #257

stevengj opened this issue Dec 8, 2023 · 2 comments

Comments

@stevengj
Copy link
Member

stevengj commented Dec 8, 2023

I noticed an odd case in JuliaLang/julia#52408 (comment):

julia> using Unicode: normalize

julia> s = "J\uf72\uec8\u345\u315\u5bf\u5bb\U1d16d\u5b0\u334\u35c"
"J"

julia> normalize(s, casefold=true) == normalize(normalize(s), casefold=true)
false

julia> normalize(normalize(s, casefold=true)) == normalize(normalize(s), casefold=true)
false

(The Julia Unicode.normalize function calls utf8proc, and defaults to NFC normalization.)

Not sure if this is a bug or just a weird behavior of Unicode. Would be good to try it out with ICU or some other library.

@stevengj
Copy link
Member Author

stevengj commented Dec 8, 2023

I get something similar in Python 3:

>>> import unicodedata
>>> s = "J\u0f72\u0ec8\u0345\u0315\u05bf\u05bb\U0001d16d\u05b0\u0334\u035c"
>>> unicodedata.normalize("NFC", s.casefold()) == unicodedata.normalize("NFC", s).casefold()
False
>>> unicodedata.normalize("NFC", s.casefold()) == unicodedata.normalize("NFC", unicodedata.normalize("NFC", s).casefold())
False

So I guess this is a weird quirk of Unicode?

@StefanKarpinski
Copy link
Member

That's quite unfortunate. Seems like exactly the kind of thing the Unicode Consortium is supposed to think through and avoid.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants