Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GB 18030 2000 vs 2005 #22

Closed
jungshik opened this issue Dec 10, 2015 · 5 comments
Closed

GB 18030 2000 vs 2005 #22

jungshik opened this issue Dec 10, 2015 · 5 comments

Comments

@jungshik
Copy link

This is the continuation of https://www.w3.org/Bugs/Public/show_bug.cgi?id=28740#c11

I forgot to reply @annevk's question there:

Jungshik, do you mean you want to make the swap mentioned at the end of comment 5?

> GB 18030   -2005  -2000
> 0xA8BC     U+1E3F U+E7C7
> 0x8135F437 U+E7C7 U+1E3F

My answer would be yes. Chrome, Safari and Opera do that. Firefox and IE do not.

My goal is to minimize the number of PUA code points after decoding partly because there'll be NO font support for those PUA code points on platforms like Android, iOS (and even on Windows 10 when additional fonts are installed for legacy compatibility. That is, old fonts like Simsun support them, but newer fonts like Microsoft Yahei do not).

https://www.w3.org/Bugs/Public/show_bug.cgi?id=28740#c1 lists them and I thought that there are a bunch of PUA code point mappings that are dropped in GB 18030:2005 in favor of the regular Unicode code points.

According to Masatoshi Kimura , it's only U+1E3F for 0xA8BC that moved out of PUA area in GB 18030:2005, which is a big disappointment. (I wish GB18030 had taken a similar step to what's taken by HKSCS when it comes to PUA).

Anyway, at least one code point (0xA8BC <=> U+1E3F) should be mapped to a regular Unicode code point per GB18030:2005 instead of 2000.

@annevk
Copy link
Member

annevk commented Dec 16, 2015

In terms of the standard, the proposal here is to replace (7533, 0xE7C7) in https://encoding.spec.whatwg.org/index-gb18030.txt with (7533, 0x1E3F). I would be okay with that. Paging @hsivonen and @travisleithead as a heads up.

@annevk
Copy link
Member

annevk commented Jan 6, 2016

@vyv03354 I don't understand #17 (comment) since it seems these code points round trip fine at the moment. Did you mean that if I make the change I suggested above we have a new problem unless I change something else too?

@vyv03354
Copy link
Collaborator

vyv03354 commented Jan 6, 2016

If we only changed the mapping for 0xA8BC, the mapping table will no longer have U+E7C7. We should also change the mapping for 0x8135F437.
That said, it may not be a big deal because we already do not have U+E5E5.

@annevk
Copy link
Member

annevk commented Jan 6, 2016

You're right. And we cannot simply adjust gb18030 ranges I think so we would have to hard code it. 0x8135F437 becomes pointer 7457 so we could special case that in https://encoding.spec.whatwg.org/#index-gb18030-ranges-code-point (simply return U+E7C7 for that pointer). And then we would have to do the same in https://encoding.spec.whatwg.org/#index-gb18030-ranges-pointer if we wanted to keep round tripping this code point (if code point is U+E7C7, return 7457).

So this would result in an uglier algorithm, but if you all think it's worth it that's fine with me.

annevk added a commit that referenced this issue Jan 6, 2016
This changes a single mapping in index gb18030 and special cases a
lookup in the “index gb18030 ranges code point” and “index gb18030
ranges pointer” algorithms.
annevk added a commit that referenced this issue Jan 6, 2016
This changes a single mapping in index gb18030 and special cases a
lookup in the “index gb18030 ranges code point” and “index gb18030
ranges pointer” algorithms.
annevk added a commit that referenced this issue Jan 6, 2016
This changes a single mapping in index gb18030 and special cases a
lookup in the “index gb18030 ranges code point” and “index gb18030
ranges pointer” algorithms.
@annevk
Copy link
Member

annevk commented Jan 6, 2016

I created a PR for my proposal in #26. I would appreciate review before landing this.

@annevk annevk closed this as completed in e7b9ce0 Jan 20, 2016
annevk added a commit that referenced this issue Oct 4, 2024
This implements the Unicode Technical Committee recommendation around GB18030-2022 in a matter suitable for this standard, taking into account existing practice and the closeness between GBK and gb18030.

In particular, using the text file attached to https://www.unicode.org/L2/L2023/23003r-gb18030-recommendations.pdf this does the following:

1. Merges the first set of 18 mappings, which are bidirectional, directly into index gb18030, replacing existing PUA entries. This ends up impacting GBK and gb18030.
2. The second set of 18 mappings (from PUA to bytes) are encoded as an encoder only table, for both GBK and gb18030.
3. The third set of 18 mappings (from bytes to code points) are ignored, as they are already covered by index gb18030 ranges. (Presumably they are included because the recommendation covers the transition from "Previous Mappings" to "Current Mappings" to "Recommended Mappings", whereas we are going directly from "Previous Mappings" to "Recommended Mappings".)

The reason for changing GBK as well is because Chromium and WebKit have already code in the wild that impacts GBK to some degree (although the encoder only table is excluded for GBK only at the moment, including that would make the most sense compatibility-wise) and no fallout has been recorded. Additionally GBK is already positioned as a rough subset of gb18030 in this standard, with the decoder being shared completely.

Tests: encoding/legacy-mb-schinese has some GB18030-2022 coverage already. This is completed with web-platform-tests/wpt#48239 and web-platform-tests/wpt#48240.

This supersedes #335. This fixes #27 and fixes #312.

This also updates the description of index gb18030 ranges to account for #22 (the change from GB18030-2000 to -2005) which it until now did not.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

3 participants