normalization does not commute with case-folding? #257

stevengj · 2023-12-08T03:39:19Z

I noticed an odd case in JuliaLang/julia#52408 (comment):

julia> using Unicode: normalize

julia> s = "J\uf72\uec8\u345\u315\u5bf\u5bb\U1d16d\u5b0\u334\u35c"
"J"

julia> normalize(s, casefold=true) == normalize(normalize(s), casefold=true)
false

julia> normalize(normalize(s, casefold=true)) == normalize(normalize(s), casefold=true)
false

(The Julia Unicode.normalize function calls utf8proc, and defaults to NFC normalization.)

Not sure if this is a bug or just a weird behavior of Unicode. Would be good to try it out with ICU or some other library.

The text was updated successfully, but these errors were encountered:

stevengj · 2023-12-08T03:44:11Z

I get something similar in Python 3:

>>> import unicodedata
>>> s = "J\u0f72\u0ec8\u0345\u0315\u05bf\u05bb\U0001d16d\u05b0\u0334\u035c"
>>> unicodedata.normalize("NFC", s.casefold()) == unicodedata.normalize("NFC", s).casefold()
False
>>> unicodedata.normalize("NFC", s.casefold()) == unicodedata.normalize("NFC", unicodedata.normalize("NFC", s).casefold())
False

So I guess this is a weird quirk of Unicode?

StefanKarpinski · 2023-12-19T12:49:57Z

That's quite unfortunate. Seems like exactly the kind of thing the Unicode Consortium is supposed to think through and avoid.

stevengj closed this as completed Dec 8, 2023

stevengj mentioned this issue Dec 8, 2023

isequal_normalized("בְּ", Unicode.normalize("בְּ")) == false JuliaLang/julia#52408

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

normalization does not commute with case-folding? #257

normalization does not commute with case-folding? #257

stevengj commented Dec 8, 2023

stevengj commented Dec 8, 2023 •

edited

Loading

StefanKarpinski commented Dec 19, 2023

normalization does not commute with case-folding? #257

normalization does not commute with case-folding? #257

Comments

stevengj commented Dec 8, 2023

stevengj commented Dec 8, 2023 • edited Loading

StefanKarpinski commented Dec 19, 2023

stevengj commented Dec 8, 2023 •

edited

Loading