Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gb18030 encoding/decoding support #57

Closed
r12a opened this issue Jun 20, 2016 · 23 comments
Closed

gb18030 encoding/decoding support #57

r12a opened this issue Jun 20, 2016 · 23 comments
Labels

Comments

@r12a
Copy link
Collaborator

r12a commented Jun 20, 2016

Results for a series of tests for gb18030 encoding/decoding can be found at
https://www.w3.org/International/tests/repo/results/encoding-dbl-byte.en#gb18030

The tests can be run from that page (select the link in the left-most column) or get the tests from the WPT repo. There is a PR at
web-platform-tests/wpt#3195

The test check whether:

  1. the browser produces the expected byte sequences for all characters in the gb18030 index after 0x9F when encoding bytes for a URL produced by a form, using the encoder steps in the specification.
  2. the browser produces the expected byte sequences for miscellaneous characters not in the gb18030 index when encoding bytes for a URL produced by a form, using the encoder steps in the specification. (tests for several ranges)
  3. same two types of test when writing characters to an href value
  4. the browser decodes all characters as expected from a file generated by encoding all pointers in the gb18030 index per the encoder steps in the specification.
  5. the browser decodes all characters as expected from a file generated by encoding miscellaneous characters not in the gb18030 index per the encoder steps in the specification. (tests for several ranges)
  6. when decoding gb18030 text, the browser uses replacement characters as described by the algorithm in the Encoding spec.

The following summarises the current situation according to my testing, for major desktop browsers. (I will be adding nightly results and perhaps other browsers in time.) The table lists the number of characters that were NOT successfully converted by the test.

screen shot 2016-06-23 at 19 41 14

Notes:

  • all href tests fail for Edge because characters are not converted to percent-escapes

Can we please investigate the failures to ascertain whether:

  1. the browser needs to be changed
  2. the spec needs to be changed
  3. the test is at fault

The following tool may be helpful for investigating issues. It converts between byte sequences and characters for all encodings in the Encoding spec. http://r12a.github.io/apps/encodings/

@r12a
Copy link
Collaborator Author

r12a commented Jun 23, 2016

I updated the table. Something went wrong with the scoring for Safari in the previous version. Should now reflect reality.

@r12a
Copy link
Collaborator Author

r12a commented Sep 15, 2016

@inexorabletash
Copy link
Member

Note to self: Chrome shows the 3 failures for decode errors, but when the API (TextDecoder) is used it produces the the correct result. I believe this is because Chrome doesn't "flush" at the end of resource (i.e. non-API) streams... might be the same in other browsers.

@jungshik
Copy link

FYI, Chromium may soon change the decoding table to map 28 byte sequences that used to be mapped to PUA code points (completely useless and even harmful on platforms where there's no font to cover those PUA code points) to regular Unicode characters (see #22, #27 and http://crbug.com/645783 ).

@annevk annevk added the tests label Nov 16, 2016
@r12a
Copy link
Collaborator Author

r12a commented Nov 18, 2016

wrt Safari support for GB 18030 (and probably other encodings), there's a discussion at https://bugs.webkit.org/show_bug.cgi?id=159891 that people following the Encoding issues may be able to help with.

@inexorabletash
Copy link
Member

I'm loathe to jump in on that bug, but ISTM the answer for WebKit should be "normalize on input". Once it's in the DOM, normalization should not happen, as @r12a points out.

@annevk
Copy link
Member

annevk commented Nov 18, 2016

I added a comment to that effect.

@hsivonen
Copy link
Member

The tests seem to disargee with the spec on the handling of ASCII bytes as part of a malformed sequence when decoding:
Fail step 2: 82 30 C3 assert_equals: expected "�" but got "�0�"
Fail step 5.7: 82 FF C3 33 assert_equals: expected "��" but got "��3"
Fail step 9: FF 30 C3 33 assert_equals: expected "�0�" but got "�0�3"

@r12a
Copy link
Collaborator Author

r12a commented Jun 15, 2017

@hsivonen just so you know, i'm still intending to check the above and change the tests (and results) where needed, when i get a moment. Same goes for similar comments elsewhere. I've had even less time than normal lately because of various distractions.

@hsivonen
Copy link
Member

i'm still intending to check the above and change the tests (and results) where needed, when i get a moment. Same goes for similar comments elsewhere.

Great. Thank you.

@domenic
Copy link
Member

domenic commented Jun 15, 2017

So, I've been helping rebase @r12a's pull requests, fix lint errors, and address some review comments from web-platform-tests/wpt#3194 that apply to all of them. I am happy to continue doing that as I have it down to a pretty fast process. Which means if @r12a can just edit his remaining WPT PRs with the normative changes and then ping the appropriate thread, I am happy to carry things through to the finish line. Woohoo!

@r12a
Copy link
Collaborator Author

r12a commented Aug 15, 2017

@hsivonen wrt #57 (comment), i have stepped through the conversion for the first two tests you mentioned several times, using the debugger alongside the spec text, and i still come up with the results expected by the test, rather than the results i get from Firefox (nightly). Are you able to point out for me why the test produces a different result from FF?

Here's a link to the test: https://www.w3.org/International/tests/repo/encoding/legacy-mb-schinese/gb18030/gb18030-decode-errors.html Thanks.

@annevk
Copy link
Member

annevk commented Aug 16, 2017

I looked at the step 2 test. Below the iterations and the results:

  1. gb18030 first = 0x82
  2. gb18030 first = 0x82, gb18030 second = 0x30
  3. gb18030 first = 0x82, gb18030 second = 0x30, gb18030 third = 0xC3
  4. Conditional in step 2 is true
  5. Conditional in step 1 is true

So @r12a is correct.

@r12a
Copy link
Collaborator Author

r12a commented Oct 4, 2017

@hsivonen ping wrt #57 (comment)

@hsivonen
Copy link
Member

hsivonen commented Nov 6, 2017

Sorry about the delay.

Firefox, Chrome and Safari agree with each other on the 3 remaining failures. Edge is closer to the other browses than to the spec.

So I think this is a spec bug. (And the tests reflect the spec.)

@hsivonen
Copy link
Member

hsivonen commented Nov 6, 2017

Hmm. The Firefox situation might be confused somewhere between the Chrome/Safari behavior and the spec behavior. I need to investigate this more.

@hsivonen
Copy link
Member

hsivonen commented Nov 7, 2017

OK. Here's what's happening:

Firefox implements the spec, but the test case doesn't test the spec. The test case expectations are written as if there was EOF after each examined sequence. However, the test input is not TextDecoder calls with just the sequences tested (each ending in EOF). Instead, it's an HTML file containing all the sequences within spans, so what comes after each test sequence isn't an EOF but a less-than sign (of the span end tag).

I will need to test what Chrome, Safari and Edge do when the sequences actually end in EOF, but my tentative opinion is that it's bad for the spec collapse a bogus sequence of bytes to a different output when the bogus sequence is followed by EOF vs. when it's followed by something else that's not a valid continuation of the sequence.

@hsivonen
Copy link
Member

hsivonen commented Nov 7, 2017

I wrote demos that exercise both the followed-by-end-tag case and the followed-by-EOF case.

The spec, Firefox, Chrome and Safari agree on these, so I think it's the best not to change the spec even though it is rather unfortunate for the treatment of the bogus byte sequence to differs depending on what comes after.

In conclusion, this is a test case bug after all.

@annevk
Copy link
Member

annevk commented Apr 25, 2018

@r12a are you planning on updating the tests?

@peteroupc
Copy link
Contributor

peteroupc commented Jul 8, 2018

Some time ago, the Encoding Standard started mapping the two bytes "0xA3 0xA0" to "U+3000" rather than U+E5E5 "to be compatible with deployed content".

Do the benefits of this mapping still outweigh the disadvantages even today? The answer depends largely on—

  • how commonly GB18030 (as opposed to other encodings) is used in Web sites today compared to when the mapping was made,
  • whether, today compared to when the mapping was made, the GB18030 bytes "0xA3 0xA0" are still commonly used instead of the proper mapping for U+3000 to represent a space character in Web sites,
  • how frequently Web sites that use "0xA3 0xA0" instead of the proper mapping for U+3000 are seen today compared to when the mapping was made,
  • whether GB18030 encoders/decoders that convert "0xA3 0xA0" to or from U+3000 (besides implementations of the Encoding Standard) are still widely deployed today,
  • the effort required for Encoding Standard implementations to update their GB18030 mapping tables,
  • the need and desirability to comply strictly with GB18030-2005, and
  • other factors weighing for or against this mapping.

@inexorabletash
Copy link
Member

Given that Firefox/Chrome/Safari have the same behavior, we'd need data indicating that changing implementations strongly improves compatibility with deployed content.

This is a variation of your 2nd point. If the number of sites using "0xA3 0xA0" intending U+E5E5 has significantly increased, it would be a consideration.

I don't think the other points would be directly relevant to implementors making a decision.

@annevk
Copy link
Member

annevk commented Oct 17, 2018

Same problem as with the gbk tests, upstreaming never completed: web-platform-tests/wpt#20361.

@annevk
Copy link
Member

annevk commented Sep 18, 2024

I'm pretty happy with the tests Alex added to https://github.com/web-platform-tests/wpt/tree/master/encoding/legacy-mb-schinese. We'll be adding more as part of the GB18030-2022 work so I think I'll consider this completed now.

@annevk annevk closed this as completed Sep 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

No branches or pull requests

7 participants