Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

iso8859-1 vs windows-1252 #25851

Closed
hashseed opened this issue Jan 31, 2019 · 11 comments
Closed

iso8859-1 vs windows-1252 #25851

hashseed opened this issue Jan 31, 2019 · 11 comments
Labels
buffer Issues and PRs related to the buffer subsystem. i18n-api Issues and PRs related to the i18n implementation. string_decoder Issues and PRs related to the string_decoder subsystem.

Comments

@hashseed
Copy link
Member

hashseed commented Jan 31, 2019

This is somewhat related to #13722, but not quite.

Wikipedia contains the gist:

It is very common to mislabel Windows-1252 text with the charset label ISO-8859-1. [...] Most modern web browsers and e-mail clients treat the media type charset ISO-8859-1 as Windows-1252 to accommodate such mislabeling. This is now standard behavior in the HTML5 specification, which requires that documents advertised as ISO-8859-1 actually be parsed with the Windows-1252 encoding.

Chromium's ICU interprets "iso8859-1" to mean Windows-1252. Node.js does not. The WHATWG spec suggests Chromium's behavior to be correct.

The consequence of all of this is that when I build Node.js with Chromium's ICU, test/parallel/test-icu-transcode.js fails due to the character "€", which Windows-1252 includes, but ISO-8859-1 does not.

I propose:

  1. Change the test to pass for both interpretations.
  2. Conform to Chromium's behavior.
@hashseed
Copy link
Member Author

@srl295

@addaleax addaleax added the i18n-api Issues and PRs related to the i18n implementation. label Jan 31, 2019
@joyeecheung joyeecheung added buffer Issues and PRs related to the buffer subsystem. string_decoder Issues and PRs related to the string_decoder subsystem. labels Feb 1, 2019
@joyeecheung
Copy link
Member

The consequence of all of this is that when I build Node.js with Chromium's ICU, test/parallel/test-icu-transcode.js fails due to the character "€"

From a glance of the test I couldn't see how the encoding differences would come into play if the test itself was decoded properly. Was the test file stored in Windows-1252? And then got interpreted as utf8 by the CJS loader?

@hashseed
Copy link
Member Author

hashseed commented Feb 1, 2019

The test loads the euro sign as utf8 and converts to latin1. With iso8859-1 it doesn't encode so it gets replaced by "?".

@richardlau
Copy link
Member

Is this affected by full-icu?

@richardlau
Copy link
Member

Also #13722.

@hashseed
Copy link
Member Author

hashseed commented Feb 1, 2019

I can check again, but iirc full icu doesn't fix this issue. If it did, the full icu build would fail the above mentioned test.

@hashseed
Copy link
Member Author

hashseed commented Feb 1, 2019

Same actually also applies to "ascii". According to WHATWG it is to be interpreted as Windows-1252 as well.

@mhdawson
Copy link
Member

mhdawson commented Feb 1, 2019

@srl295 as our ICU expert.

@TimothyGu
Copy link
Member

We have a dedicated WHATWG text encoder/decoder util.TextDecoder and util.TextEncoder, which should strictly follow the semantics of the Encoding Standard. buffer.transcode() should indeed just fall back on whatever the linked ICU does. I believe #25866 is a good change because of this.

danbev pushed a commit that referenced this issue Feb 5, 2019
Chromium's ICU considers "latin1" and "ascii" to mean Windows-1252,
which is consistent with
[WHATWG spec](https://encoding.spec.whatwg.org/#names-and-labels). If
linked against Chromium's ICU, Node.js therefore fails
`test/parallel/test-icu-transcode`.

PR-URL: #25866
Refs: #25851
Reviewed-By: Anna Henningsen <anna@addaleax.net>
Reviewed-By: Tiancheng "Timothy" Gu <timothygu99@gmail.com>
@hashseed
Copy link
Member Author

hashseed commented Feb 5, 2019

Test fix landed.

@hashseed hashseed closed this as completed Feb 5, 2019
addaleax pushed a commit that referenced this issue Feb 6, 2019
Chromium's ICU considers "latin1" and "ascii" to mean Windows-1252,
which is consistent with
[WHATWG spec](https://encoding.spec.whatwg.org/#names-and-labels). If
linked against Chromium's ICU, Node.js therefore fails
`test/parallel/test-icu-transcode`.

PR-URL: #25866
Refs: #25851
Reviewed-By: Anna Henningsen <anna@addaleax.net>
Reviewed-By: Tiancheng "Timothy" Gu <timothygu99@gmail.com>
@srl295
Copy link
Member

srl295 commented Feb 14, 2019

sorry to miss this, but the fix looks good.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
buffer Issues and PRs related to the buffer subsystem. i18n-api Issues and PRs related to the i18n implementation. string_decoder Issues and PRs related to the string_decoder subsystem.
Projects
None yet
Development

No branches or pull requests

7 participants