-
Notifications
You must be signed in to change notification settings - Fork 78
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GB 18030 2000 vs 2005 #22
Comments
In terms of the standard, the proposal here is to replace (7533, 0xE7C7) in https://encoding.spec.whatwg.org/index-gb18030.txt with (7533, 0x1E3F). I would be okay with that. Paging @hsivonen and @travisleithead as a heads up. |
@vyv03354 I don't understand #17 (comment) since it seems these code points round trip fine at the moment. Did you mean that if I make the change I suggested above we have a new problem unless I change something else too? |
If we only changed the mapping for 0xA8BC, the mapping table will no longer have U+E7C7. We should also change the mapping for 0x8135F437. |
You're right. And we cannot simply adjust gb18030 ranges I think so we would have to hard code it. 0x8135F437 becomes pointer 7457 so we could special case that in https://encoding.spec.whatwg.org/#index-gb18030-ranges-code-point (simply return U+E7C7 for that pointer). And then we would have to do the same in https://encoding.spec.whatwg.org/#index-gb18030-ranges-pointer if we wanted to keep round tripping this code point (if code point is U+E7C7, return 7457). So this would result in an uglier algorithm, but if you all think it's worth it that's fine with me. |
This changes a single mapping in index gb18030 and special cases a lookup in the “index gb18030 ranges code point” and “index gb18030 ranges pointer” algorithms.
This changes a single mapping in index gb18030 and special cases a lookup in the “index gb18030 ranges code point” and “index gb18030 ranges pointer” algorithms.
This changes a single mapping in index gb18030 and special cases a lookup in the “index gb18030 ranges code point” and “index gb18030 ranges pointer” algorithms.
I created a PR for my proposal in #26. I would appreciate review before landing this. |
This implements the Unicode Technical Committee recommendation around GB18030-2022 in a matter suitable for this standard, taking into account existing practice and the closeness between GBK and gb18030. In particular, using the text file attached to https://www.unicode.org/L2/L2023/23003r-gb18030-recommendations.pdf this does the following: 1. Merges the first set of 18 mappings, which are bidirectional, directly into index gb18030, replacing existing PUA entries. This ends up impacting GBK and gb18030. 2. The second set of 18 mappings (from PUA to bytes) are encoded as an encoder only table, for both GBK and gb18030. 3. The third set of 18 mappings (from bytes to code points) are ignored, as they are already covered by index gb18030 ranges. (Presumably they are included because the recommendation covers the transition from "Previous Mappings" to "Current Mappings" to "Recommended Mappings", whereas we are going directly from "Previous Mappings" to "Recommended Mappings".) The reason for changing GBK as well is because Chromium and WebKit have already code in the wild that impacts GBK to some degree (although the encoder only table is excluded for GBK only at the moment, including that would make the most sense compatibility-wise) and no fallout has been recorded. Additionally GBK is already positioned as a rough subset of gb18030 in this standard, with the decoder being shared completely. Tests: encoding/legacy-mb-schinese has some GB18030-2022 coverage already. This is completed with web-platform-tests/wpt#48239 and web-platform-tests/wpt#48240. This supersedes #335. This fixes #27 and fixes #312. This also updates the description of index gb18030 ranges to account for #22 (the change from GB18030-2000 to -2005) which it until now did not.
This is the continuation of https://www.w3.org/Bugs/Public/show_bug.cgi?id=28740#c11
I forgot to reply @annevk's question there:
My answer would be yes. Chrome, Safari and Opera do that. Firefox and IE do not.
My goal is to minimize the number of PUA code points after decoding partly because there'll be NO font support for those PUA code points on platforms like Android, iOS (and even on Windows 10 when additional fonts are installed for legacy compatibility. That is, old fonts like Simsun support them, but newer fonts like Microsoft Yahei do not).
https://www.w3.org/Bugs/Public/show_bug.cgi?id=28740#c1 lists them and I thought that there are a bunch of PUA code point mappings that are dropped in GB 18030:2005 in favor of the regular Unicode code points.
According to Masatoshi Kimura , it's only U+1E3F for 0xA8BC that moved out of PUA area in GB 18030:2005, which is a big disappointment. (I wish GB18030 had taken a similar step to what's taken by HKSCS when it comes to PUA).
Anyway, at least one code point (0xA8BC <=> U+1E3F) should be mapped to a regular Unicode code point per GB18030:2005 instead of 2000.
The text was updated successfully, but these errors were encountered: