Skip to content

Encodings used in iconv-lite mismatch the WHATWG Encoding spec significantly #367

@ChALkeR

Description

@ChALkeR

Previously: #360, now reopened with more data
Refs: jsdom/whatwg-encoding#22

Note

If interop with WHATWG Encoding is a non-target, feel free to close this
Documenting the discrepancies would be helpful though

In this image, whatwg-encoding is what iconv-lite does (as that's a wrapper on top of iconv-lite, I did not create a separate column)

Image

Spec used: https://encoding.spec.whatwg.org/

  1. Half of single-byte encodings including windows-1252 don't match the spec and decode differently

    E.g., even for the most basic windows-1252 encoding:

    > require('iconv-lite').decode(Uint8Array.of(0x8d, 0x8f, 0x90), 'windows-1252')
    '���'
    > require('@exodus/bytes/encoding.js').legacyHookDecode(Uint8Array.of(0x8d, 0x8f, 0x90), 'windows-1252')
    '\x8D\x8F\x90'

    The latter behavior is correct, see the mapping from the Encoding spec

  2. utf-8 is wrong when bundled. Because https://npmjs.com/buffer polyfill is wrong and iconv-lite uses that instead of a clean impl.

  3. utf-16 is wrong because it doesn't produce well-formed strings

  4. All of the multi-byte encodings don't match the decoders in the WHATWG Encoding spec

I can test iconv-lite separately further but I confirmed that all those discrepancies are also happening on pure iconv-lite

If interop is desired:

  1. Fix all single-byte mappings (here they are: https://encoding.spec.whatwg.org/#legacy-single-byte-encodings)
  2. Replace utf-8 decoder with a compliant one (global TextDecoder with ignoreBOM is usually fine unless you are using stream)
  3. Replace utf-16 decoder with a compliant one (global TextDecoder with ignoreBOM is fine unless you are running in Node.js without ICU, where utf16-le is exposed but broken and utf-16be does not exist)
  4. Adjust legacy multi-byte decoders to behave by the Encoding spec (and likely encoders too, the spec describes those too)

For some of that, you could check how I did it in https://github.com/ExodusOSS/bytes 🙃
Which also exposes utf8/utf16 encoders/decoders and single-byte/legacy multi-byte decoders, but I doubt you want to depend on that as it would increase the tables size 1.5x
Improving the approach here based on that impl could be nice though

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions