Encodings used in iconv-lite mismatch the WHATWG Encoding spec significantly

Previously: #360, now reopened with more data
Refs: https://github.com/jsdom/whatwg-encoding/issues/22

> [!NOTE]  
> If interop with WHATWG Encoding is a non-target, feel free to close this
> Documenting the discrepancies  would be helpful though

In this image, [`whatwg-encoding`](https://npmjs.com/whatwg-encoding) is what `iconv-lite` does (as that's a wrapper on top of iconv-lite, I did not create a separate column)

<img width="1115" height="882" alt="Image" src="https://github.com/user-attachments/assets/3b88e726-f4af-437e-8419-7feaf47e2b02" />

Spec used: https://encoding.spec.whatwg.org/

1. Half of single-byte encodings including `windows-1252` don't match the spec and decode differently

    E.g., even for the most basic `windows-1252` encoding:
    ```js
    > require('iconv-lite').decode(Uint8Array.of(0x8d, 0x8f, 0x90), 'windows-1252')
    '���'
    > require('@exodus/bytes/encoding.js').legacyHookDecode(Uint8Array.of(0x8d, 0x8f, 0x90), 'windows-1252')
    '\x8D\x8F\x90'
    ```

    The latter behavior is correct, see [the mapping from the Encoding spec](https://encoding.spec.whatwg.org/index-windows-1252.txt)

2. `utf-8` is wrong when bundled. Because https://npmjs.com/buffer polyfill is wrong and `iconv-lite` uses that instead of a clean impl.
3. `utf-16` is wrong because it doesn't produce well-formed strings
4. All of the multi-byte encodings don't match the decoders in the WHATWG Encoding spec

I can test iconv-lite separately further but I confirmed that all those discrepancies are also happening on pure `iconv-lite`

If interop is desired:
1. Fix all single-byte mappings (here they are: https://encoding.spec.whatwg.org/#legacy-single-byte-encodings)
2. Replace `utf-8` decoder with a compliant one (global TextDecoder with ignoreBOM is _usually_ fine unless you are using stream)
3. Replace `utf-16` decoder with a compliant one (global TextDecoder with ignoreBOM is fine _unless you are running in Node.js without ICU, where utf16-le is exposed but broken and utf-16be does not exist_)
4. Adjust legacy multi-byte decoders to behave by the Encoding spec (and likely encoders too, the spec describes those too)

For some of that, you could check how I did it in https://github.com/ExodusOSS/bytes 🙃
Which also exposes utf8/utf16 encoders/decoders and single-byte/legacy multi-byte decoders, but I doubt you want to depend on that as it would increase the tables size 1.5x
Improving the approach here based on that impl could be nice though

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Encodings used in iconv-lite mismatch the WHATWG Encoding spec significantly #367

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Encodings used in iconv-lite mismatch the WHATWG Encoding spec significantly #367

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions