Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support GB18030-2022 #335

Closed
wants to merge 1 commit into from
Closed

Support GB18030-2022 #335

wants to merge 1 commit into from

Conversation

annevk
Copy link
Member

@annevk annevk commented Sep 17, 2024

One legacy encoding was updated and relevant regulation requires software to match. As such the Encoding Standard should match as well. This aims to make the minimum number of changes necessary and does not impact GBK, only gb18030.

Updated tests are upstreamed in web-platform-tests/wpt#48221. (See also the comment below.)

  • At least two implementers are interested (and none opposed):
    • WebKit
  • Tests are written and can be reviewed and commented upon at:
    • See above.
  • Implementation bugs are filed:
    • Chromium: …
    • Gecko: …
    • WebKit: N/A
    • Deno: …
    • Node.js: …
  • MDN issue is filed: N/A
  • The top of this comment includes a clear commit message to use.

(See WHATWG Working Mode: Changes for more details.)


Preview | Diff

One legacy encoding was updated and relevant regulation requires software to match. As such the Encoding Standard should match as well. This aims to make the minimum number of changes necessary and does not impact GBK, only gb18030.

Updated tests are in https://github.com/WebKit/WebKit/tree/main/LayoutTests/imported/w3c/web-platform-tests/encoding/legacy-mb-schinese/gb18030. If these are deemed satisfactory they will be exported.
@annevk
Copy link
Member Author

annevk commented Sep 17, 2024

The tests are upstreamed in web-platform-tests/wpt#48221 and I noticed there that Chromium previously imported WebKit's changes and exported them to web-platform-tests. This was done by @yoshisatoyanagisawa and @tkent-google: web-platform-tests/wpt@2ddb8af.

Thus it seems Chromium has partial compliance with these changes. But not the most recent changes where WebKit (once again) aligned with the Unicode recommendation around gb18030.

One thing that might be good to test is that GBK is not impacted by these new mappings. (At least, I've been assuming we don't want that.)

@achristensen07 @ricea @hsivonen @domenic thoughts and review appreciated!

@annevk annevk added normative i18n-tracker Group bringing to attention of Internationalization, or tracked by i18n but not needing response. addition/proposal New features or enhancements labels Sep 17, 2024
Comment on lines +2356 to +2359
<p>If <a for="gb18030 decoder">is GBK</a> is false and there is a row in the table below whose
first column is <a>gb18030 first</a>, second column is <a>gb18030 second</a>, third column is
<a>gb18030 third</a>, and fourth column is <var>byte</var>, then set <var>code point</var> to
the fifth column on the same row:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on test results in Firefox I calculated the first override here and it appears that this is already the existing behavior of gb18030 as written in the standard. As such this table isn't needed. It's not clear to me why the Unicode recommendation calls out these four-byte sequences mapping to code points if that matches existing behavior.

Comment on lines +2695 to +2697
<p>If <a for="gb18030 encoder">is GBK</a> is false and there is a row in the table below whose
first column is <var>code point</var>, then return the two bytes on the same row listed in the
second column:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As long as we layer this on top of the existing gb18030 index, we can remove all the U+EXXX PUA entries below as they are already in that table and map to the correct bytes.

An implementation that updates that table would need these U+EXXX mappings though, as evidenced by WebKit. (Which I assume has a completely separate implementation for GBK.)

With those U+EXXX mappings removed however it probably makes the most sense to create a small "gb18030 2022 index" which can be reused across the decoder and encoder as a special case when GBK is not in use. Bit more work, but overall would present the data more neatly.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WebKit (and presumably Chromium) having updated the table (as evidenced by vitorroriz/WebKit@b7d4f07) means GBK is impacted at the moment as well. In particular for the symmetrical mappings at the top of #312 (comment). What do we want to do for GBK? Perhaps it's okay if that gets slightly less PUA as well? @hsivonen what do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed on Matrix, given that WebKit and Chromium did not get complaints for decoding less PUA for GBK, let's keep GB18030 and GBK aligned even though it means diverging from Windows Code Page 936.

Copy link

@StoneChi8 StoneChi8 Sep 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GBK-1995 never been an offical standard, although it asign some characters in U+E8xx for GB+FExx, and 52 Chinese characters has been replaced by GB18030-2000 for unicode extension A, remained GB code GB+FExx unchanged, ex. GB+FE9F【䶮】 mapped to U+4DAE instead of U+E863. For information interchange, we should use official GB18030-2022 mapping table enven in GBK quotation, in order to drop these duplicate unicode code to those same GB 2 bytes characters.
Windows CP 936 method is wrong way to reach the GB18030 standard, i.e. remained U+E8xx characters in ttf font file (Source Han Sans & iOS never done these),convert program using these PUA characters and assign 0x3F to 4 bytes GB18030 characters.
On the other hand, full BMP PUA code range is U+E000-U+F8FF, and SMP U+10000-U+10FFFF mapping to GB18030 is a GB+90308130~ only, no mapping table need for programming and future GB18030 amendments.
See detail in https://zhuanlan.zhihu.com/p/661610604 for WAHTWG GB18030 convert program(in Chinese).

annevk added a commit that referenced this pull request Sep 18, 2024
This implements the Unicode Technical Committee recommendation around GB18030-2022 in a matter suitable for this standard, taking into account existing practice and the closeness between GBK and gb18030.

In particular, using the text file attached to https://www.unicode.org/L2/L2023/23003r-gb18030-recommendations.pdf this does the following:

1. Merges the first set of 18 mappings, which are bidirectional, directly into index gb18030, replacing existing PUA entries. This ends up impacting GBK and gb18030.
2. The second set of 18 mappings (from PUA to bytes) are encoded as an encoder only table, for both GBK and gb18030.
3. The third set of 18 mappings (from bytes to code points) are ignored, as they are already covered by index gb18030 ranges. (Presumably they are included because the recommendation covers the transition from "Previous Mappings" to "Current Mappings" to "Recommended Mappings", whereas we are going directly from "Previous Mappings" to "Recommended Mappings".)

The reason for changing GBK as well is because Chromium and WebKit have already code in the wild that impacts GBK to some degree (although the encoder only table is excluded for GBK only at the moment, including that would make the most sense compatibility-wise) and no fallout has been recorded. Additionally GBK is already positioned as a rough subset of gb18030 in this standard, with the decoder being shared completely.

Tests: encoding/legacy-mb-schinese has some GB18030-2022 coverage already. The aim is to complete that with web-platform-tests/wpt#48239 and web-platform-tests/wpt#48240.

This supersedes #335. This fixes #27 and fixes #312.
@annevk annevk mentioned this pull request Sep 18, 2024
5 tasks
@annevk
Copy link
Member Author

annevk commented Sep 18, 2024

Closing this in favor of #336.

@annevk annevk closed this Sep 18, 2024
@annevk annevk deleted the gb18030-2022 branch September 18, 2024 13:23
annevk added a commit that referenced this pull request Oct 4, 2024
This implements the Unicode Technical Committee recommendation around GB18030-2022 in a matter suitable for this standard, taking into account existing practice and the closeness between GBK and gb18030.

In particular, using the text file attached to https://www.unicode.org/L2/L2023/23003r-gb18030-recommendations.pdf this does the following:

1. Merges the first set of 18 mappings, which are bidirectional, directly into index gb18030, replacing existing PUA entries. This ends up impacting GBK and gb18030.
2. The second set of 18 mappings (from PUA to bytes) are encoded as an encoder only table, for both GBK and gb18030.
3. The third set of 18 mappings (from bytes to code points) are ignored, as they are already covered by index gb18030 ranges. (Presumably they are included because the recommendation covers the transition from "Previous Mappings" to "Current Mappings" to "Recommended Mappings", whereas we are going directly from "Previous Mappings" to "Recommended Mappings".)

The reason for changing GBK as well is because Chromium and WebKit have already code in the wild that impacts GBK to some degree (although the encoder only table is excluded for GBK only at the moment, including that would make the most sense compatibility-wise) and no fallout has been recorded. Additionally GBK is already positioned as a rough subset of gb18030 in this standard, with the decoder being shared completely.

Tests: encoding/legacy-mb-schinese has some GB18030-2022 coverage already. This is completed with web-platform-tests/wpt#48239 and web-platform-tests/wpt#48240.

This supersedes #335. This fixes #27 and fixes #312.

This also updates the description of index gb18030 ranges to account for #22 (the change from GB18030-2000 to -2005) which it until now did not.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
addition/proposal New features or enhancements i18n-tracker Group bringing to attention of Internationalization, or tracked by i18n but not needing response. normative
Development

Successfully merging this pull request may close these issues.

3 participants