GB 18030 2000 vs 2005 #22

jungshik · 2015-12-10T21:33:26Z

This is the continuation of https://www.w3.org/Bugs/Public/show_bug.cgi?id=28740#c11

I forgot to reply @annevk's question there:

Jungshik, do you mean you want to make the swap mentioned at the end of comment 5?

> GB 18030   -2005  -2000
> 0xA8BC     U+1E3F U+E7C7
> 0x8135F437 U+E7C7 U+1E3F

My answer would be yes. Chrome, Safari and Opera do that. Firefox and IE do not.

My goal is to minimize the number of PUA code points after decoding partly because there'll be NO font support for those PUA code points on platforms like Android, iOS (and even on Windows 10 when additional fonts are installed for legacy compatibility. That is, old fonts like Simsun support them, but newer fonts like Microsoft Yahei do not).

https://www.w3.org/Bugs/Public/show_bug.cgi?id=28740#c1 lists them and I thought that there are a bunch of PUA code point mappings that are dropped in GB 18030:2005 in favor of the regular Unicode code points.

According to Masatoshi Kimura , it's only U+1E3F for 0xA8BC that moved out of PUA area in GB 18030:2005, which is a big disappointment. (I wish GB18030 had taken a similar step to what's taken by HKSCS when it comes to PUA).

Anyway, at least one code point (0xA8BC <=> U+1E3F) should be mapped to a regular Unicode code point per GB18030:2005 instead of 2000.

The text was updated successfully, but these errors were encountered:

annevk · 2015-12-16T15:05:16Z

In terms of the standard, the proposal here is to replace (7533, 0xE7C7) in https://encoding.spec.whatwg.org/index-gb18030.txt with (7533, 0x1E3F). I would be okay with that. Paging @hsivonen and @travisleithead as a heads up.

annevk · 2016-01-06T12:02:00Z

@vyv03354 I don't understand #17 (comment) since it seems these code points round trip fine at the moment. Did you mean that if I make the change I suggested above we have a new problem unless I change something else too?

vyv03354 · 2016-01-06T12:25:38Z

If we only changed the mapping for 0xA8BC, the mapping table will no longer have U+E7C7. We should also change the mapping for 0x8135F437.
That said, it may not be a big deal because we already do not have U+E5E5.

annevk · 2016-01-06T12:57:44Z

You're right. And we cannot simply adjust gb18030 ranges I think so we would have to hard code it. 0x8135F437 becomes pointer 7457 so we could special case that in https://encoding.spec.whatwg.org/#index-gb18030-ranges-code-point (simply return U+E7C7 for that pointer). And then we would have to do the same in https://encoding.spec.whatwg.org/#index-gb18030-ranges-pointer if we wanted to keep round tripping this code point (if code point is U+E7C7, return 7457).

So this would result in an uglier algorithm, but if you all think it's worth it that's fine with me.

This changes a single mapping in index gb18030 and special cases a lookup in the “index gb18030 ranges code point” and “index gb18030 ranges pointer” algorithms.

annevk · 2016-01-06T17:04:04Z

I created a PR for my proposal in #26. I would appreciate review before landing this.

This implements the Unicode Technical Committee recommendation around GB18030-2022 in a matter suitable for this standard, taking into account existing practice and the closeness between GBK and gb18030. In particular, using the text file attached to https://www.unicode.org/L2/L2023/23003r-gb18030-recommendations.pdf this does the following: 1. Merges the first set of 18 mappings, which are bidirectional, directly into index gb18030, replacing existing PUA entries. This ends up impacting GBK and gb18030. 2. The second set of 18 mappings (from PUA to bytes) are encoded as an encoder only table, for both GBK and gb18030. 3. The third set of 18 mappings (from bytes to code points) are ignored, as they are already covered by index gb18030 ranges. (Presumably they are included because the recommendation covers the transition from "Previous Mappings" to "Current Mappings" to "Recommended Mappings", whereas we are going directly from "Previous Mappings" to "Recommended Mappings".) The reason for changing GBK as well is because Chromium and WebKit have already code in the wild that impacts GBK to some degree (although the encoder only table is excluded for GBK only at the moment, including that would make the most sense compatibility-wise) and no fallout has been recorded. Additionally GBK is already positioned as a rough subset of gb18030 in this standard, with the decoder being shared completely. Tests: encoding/legacy-mb-schinese has some GB18030-2022 coverage already. This is completed with web-platform-tests/wpt#48239 and web-platform-tests/wpt#48240. This supersedes #335. This fixes #27 and fixes #312. This also updates the description of index gb18030 ranges to account for #22 (the change from GB18030-2000 to -2005) which it until now did not.

vyv03354 mentioned this issue Dec 16, 2015

"gb18030 ranges" have problematic definitions #17

Closed

annevk added the needsinput label Jan 6, 2016

annevk added a commit that referenced this issue Jan 6, 2016

Fix #22: align with GB18030-2005

dd87ace

This changes a single mapping in index gb18030 and special cases a lookup in the “index gb18030 ranges code point” and “index gb18030 ranges pointer” algorithms.

annevk added a commit that referenced this issue Jan 6, 2016

Fix #22: align with GB18030-2005

f108f60

This changes a single mapping in index gb18030 and special cases a lookup in the “index gb18030 ranges code point” and “index gb18030 ranges pointer” algorithms.

annevk added a commit that referenced this issue Jan 6, 2016

Fix #22: align with GB18030-2005

f1752c8

This changes a single mapping in index gb18030 and special cases a lookup in the “index gb18030 ranges code point” and “index gb18030 ranges pointer” algorithms.

annevk closed this as completed in e7b9ce0 Jan 20, 2016

r12a mentioned this issue Feb 3, 2016

[questions/qa-choosing-encodings] warning about PUA and Shift_JIS and GB18030 w3c/i18n-drafts#12

Open

This was referenced Sep 10, 2016

If gb18030 is revised, consider aligning the Encoding Standard #27

Closed

gb18030 encoding/decoding support #57

Closed

r12a mentioned this issue Mar 14, 2018

Editorial: Charset alias matching link #134

Closed

renovate bot mentioned this issue May 28, 2021

Update dependency iconv-lite to ^0.6.0 - autoclosed yetzt/node-scrpr#6

Closed

1 task

This was referenced Jun 6, 2021

Update dependency iconv-lite to ^0.6.0 - autoclosed aoisupersix/vscode-bve5-language-support#73

Closed

Update dependency iconv-lite to ^0.6.0 zce/node-xtemplate#11

Open

Update dependency iconv-lite to ^0.6.0 quocphien90/cli-kintone-nodejs#29

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GB 18030 2000 vs 2005 #22

GB 18030 2000 vs 2005 #22

jungshik commented Dec 10, 2015

annevk commented Dec 16, 2015

annevk commented Jan 6, 2016

vyv03354 commented Jan 6, 2016

annevk commented Jan 6, 2016

annevk commented Jan 6, 2016

GB 18030 2000 vs 2005 #22

GB 18030 2000 vs 2005 #22

Comments

jungshik commented Dec 10, 2015

annevk commented Dec 16, 2015

annevk commented Jan 6, 2016

vyv03354 commented Jan 6, 2016

annevk commented Jan 6, 2016

annevk commented Jan 6, 2016