-
-
Notifications
You must be signed in to change notification settings - Fork 294
Description
Previously: #360, now reopened with more data
Refs: jsdom/whatwg-encoding#22
Note
If interop with WHATWG Encoding is a non-target, feel free to close this
Documenting the discrepancies would be helpful though
In this image, whatwg-encoding is what iconv-lite does (as that's a wrapper on top of iconv-lite, I did not create a separate column)
Spec used: https://encoding.spec.whatwg.org/
-
Half of single-byte encodings including
windows-1252don't match the spec and decode differentlyE.g., even for the most basic
windows-1252encoding:> require('iconv-lite').decode(Uint8Array.of(0x8d, 0x8f, 0x90), 'windows-1252') '���' > require('@exodus/bytes/encoding.js').legacyHookDecode(Uint8Array.of(0x8d, 0x8f, 0x90), 'windows-1252') '\x8D\x8F\x90'
The latter behavior is correct, see the mapping from the Encoding spec
-
utf-8is wrong when bundled. Because https://npmjs.com/buffer polyfill is wrong andiconv-liteuses that instead of a clean impl. -
utf-16is wrong because it doesn't produce well-formed strings -
All of the multi-byte encodings don't match the decoders in the WHATWG Encoding spec
I can test iconv-lite separately further but I confirmed that all those discrepancies are also happening on pure iconv-lite
If interop is desired:
- Fix all single-byte mappings (here they are: https://encoding.spec.whatwg.org/#legacy-single-byte-encodings)
- Replace
utf-8decoder with a compliant one (global TextDecoder with ignoreBOM is usually fine unless you are using stream) - Replace
utf-16decoder with a compliant one (global TextDecoder with ignoreBOM is fine unless you are running in Node.js without ICU, where utf16-le is exposed but broken and utf-16be does not exist) - Adjust legacy multi-byte decoders to behave by the Encoding spec (and likely encoders too, the spec describes those too)
For some of that, you could check how I did it in https://github.com/ExodusOSS/bytes 🙃
Which also exposes utf8/utf16 encoders/decoders and single-byte/legacy multi-byte decoders, but I doubt you want to depend on that as it would increase the tables size 1.5x
Improving the approach here based on that impl could be nice though