-
Notifications
You must be signed in to change notification settings - Fork 78
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gb18030 encoding/decoding support #57
Comments
I updated the table. Something went wrong with the scoring for Safari in the previous version. Should now reflect reality. |
Note to self: Chrome shows the 3 failures for decode errors, but when the API ( |
FYI, Chromium may soon change the decoding table to map 28 byte sequences that used to be mapped to PUA code points (completely useless and even harmful on platforms where there's no font to cover those PUA code points) to regular Unicode characters (see #22, #27 and http://crbug.com/645783 ). |
wrt Safari support for GB 18030 (and probably other encodings), there's a discussion at https://bugs.webkit.org/show_bug.cgi?id=159891 that people following the Encoding issues may be able to help with. |
I'm loathe to jump in on that bug, but ISTM the answer for WebKit should be "normalize on input". Once it's in the DOM, normalization should not happen, as @r12a points out. |
I added a comment to that effect. |
The tests seem to disargee with the spec on the handling of ASCII bytes as part of a malformed sequence when decoding: |
@hsivonen just so you know, i'm still intending to check the above and change the tests (and results) where needed, when i get a moment. Same goes for similar comments elsewhere. I've had even less time than normal lately because of various distractions. |
Great. Thank you. |
So, I've been helping rebase @r12a's pull requests, fix lint errors, and address some review comments from web-platform-tests/wpt#3194 that apply to all of them. I am happy to continue doing that as I have it down to a pretty fast process. Which means if @r12a can just edit his remaining WPT PRs with the normative changes and then ping the appropriate thread, I am happy to carry things through to the finish line. Woohoo! |
@hsivonen wrt #57 (comment), i have stepped through the conversion for the first two tests you mentioned several times, using the debugger alongside the spec text, and i still come up with the results expected by the test, rather than the results i get from Firefox (nightly). Are you able to point out for me why the test produces a different result from FF? Here's a link to the test: https://www.w3.org/International/tests/repo/encoding/legacy-mb-schinese/gb18030/gb18030-decode-errors.html Thanks. |
I looked at the step 2 test. Below the iterations and the results:
So @r12a is correct. |
@hsivonen ping wrt #57 (comment) |
Sorry about the delay. Firefox, Chrome and Safari agree with each other on the 3 remaining failures. Edge is closer to the other browses than to the spec. So I think this is a spec bug. (And the tests reflect the spec.) |
Hmm. The Firefox situation might be confused somewhere between the Chrome/Safari behavior and the spec behavior. I need to investigate this more. |
OK. Here's what's happening: Firefox implements the spec, but the test case doesn't test the spec. The test case expectations are written as if there was EOF after each examined sequence. However, the test input is not I will need to test what Chrome, Safari and Edge do when the sequences actually end in EOF, but my tentative opinion is that it's bad for the spec collapse a bogus sequence of bytes to a different output when the bogus sequence is followed by EOF vs. when it's followed by something else that's not a valid continuation of the sequence. |
I wrote demos that exercise both the followed-by-end-tag case and the followed-by-EOF case. The spec, Firefox, Chrome and Safari agree on these, so I think it's the best not to change the spec even though it is rather unfortunate for the treatment of the bogus byte sequence to differs depending on what comes after. In conclusion, this is a test case bug after all. |
@r12a are you planning on updating the tests? |
Some time ago, the Encoding Standard started mapping the two bytes "0xA3 0xA0" to "U+3000" rather than U+E5E5 "to be compatible with deployed content". Do the benefits of this mapping still outweigh the disadvantages even today? The answer depends largely on—
|
Given that Firefox/Chrome/Safari have the same behavior, we'd need data indicating that changing implementations strongly improves compatibility with deployed content. This is a variation of your 2nd point. If the number of sites using "0xA3 0xA0" intending U+E5E5 has significantly increased, it would be a consideration. I don't think the other points would be directly relevant to implementors making a decision. |
Same problem as with the gbk tests, upstreaming never completed: web-platform-tests/wpt#20361. |
I'm pretty happy with the tests Alex added to https://github.com/web-platform-tests/wpt/tree/master/encoding/legacy-mb-schinese. We'll be adding more as part of the GB18030-2022 work so I think I'll consider this completed now. |
Results for a series of tests for gb18030 encoding/decoding can be found at
https://www.w3.org/International/tests/repo/results/encoding-dbl-byte.en#gb18030
The tests can be run from that page (select the link in the left-most column) or get the tests from the WPT repo. There is a PR at
web-platform-tests/wpt#3195
The test check whether:
The following summarises the current situation according to my testing, for major desktop browsers. (I will be adding nightly results and perhaps other browsers in time.) The table lists the number of characters that were NOT successfully converted by the test.
Notes:
Can we please investigate the failures to ascertain whether:
The following tool may be helpful for investigating issues. It converts between byte sequences and characters for all encodings in the Encoding spec. http://r12a.github.io/apps/encodings/
The text was updated successfully, but these errors were encountered: