-
Notifications
You must be signed in to change notification settings - Fork 139
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
COMBINING GREEK YPOGEGRAMMENI case-folding #3
Comments
Chapter 3 of the Unicode standard says:
|
It seems like this issue is directly related to making use of the data in SpecialCasing.txt. |
@jiahao, it seems like SpecialCasing.txt is more specific to upper/lower/titlecase rules, rather than casefolding per se. |
In case it’s useful, here is the original bug report I sent to Jan (reply, reply, reply), and my test case: #include <stdio.h>
#include <utf8proc.h>
int main()
{
const unsigned char *in = "\xcf\x89\xcd\x85\xcd\x82"; // U+03C9 U+0345 U+0342 (ω+◌ͅ+◌͂)
unsigned char *out;
utf8proc_map(in, 0, &out,
UTF8PROC_CASEFOLD | UTF8PROC_DECOMPOSE | UTF8PROC_NULLTERM);
printf("%s\n", out); // Wrong: U+03C9 U+03B9 U+0342 (ω ι+◌͂)
free(out);
unsigned char *nfd = utf8proc_NFD(in);
utf8proc_map(nfd, 0, &out,
UTF8PROC_CASEFOLD | UTF8PROC_DECOMPOSE | UTF8PROC_NULLTERM);
printf("%s\n", out); // Right: U+03C9 U+0342 U+03B9 (ω+◌͂ ι)
free(out);
free(nfd);
return 0;
} Also, a hilarious graph. |
The U+0345 combining character needs special handling, according to Jan Behrens (utf8proc author). In particular, you apparently need to do normalization both before and after case-folding (if you are doing normalization+casefolding on a string containing this character).
As a first pass, I'm not sure it's worth trying to solve this in a super-efficient manner. Just set a flag if the character is found (during decomposition?), and then run a second normalization pass after/before case-folding if necessary.
The text was updated successfully, but these errors were encountered: