Add support for UTF-16 #30

natebosch · 2020-07-29T20:51:57Z

The only way we had to decode UTF-16 previously was package:utf which has been discontinued. We should add a utf16 encoder and decoder here.

The text was updated successfully, but these errors were encountered:

lrhn · 2020-09-09T13:00:35Z

There is no great need for a converter, since String.codeUnits and String.fromCharCodes will do the job, but that also means that it should be trivial to implement. It might make sense in some situations, e.g., to use with Stream.transform.

The plain UTF-16 converter should not be an Encoding since it doesn't output bytes.
It might actually make sense to have specialized UTF-16 little-endian and UTF-16 big-endian converters, which might even be Encodings (but it's probably safest to keep them as plain Converters).

natebosch · 2020-09-09T18:12:49Z

There is a difference between the package:utf implementation of decodeUtf16 and using String.fromCharCodes.

The former could decode the bytes [0xFE, 0xFF, 0x6C, 0x34] into 水. To get the same character using String.fromCharCodes you need to change from bytes to charcodes first, it wants as input [0xFEFF, 0x6C34].

lrhn · 2020-09-10T08:44:41Z

Exactly, that's what i was alluding to with a UTF-16 little/big-endian converter, which is a byte to string converter, not a code-unit to string converter. Your example appears to be big-endian (aka network-order).

I would expect a plain Utf16Converter to convert from UTF-16 to String, and UTF-16 is code units, not bytes representing code units.
We can do all of these, but the endian-based converters are likely more useful.

michalt · 2021-12-02T14:48:04Z

It seems that we do have a few uses of UTF-16 decoding internally. It's definitely not a high priority (there are literally about 4 uses of this), but we'll probably want to have something in dart:convert to support this use case, as it will block the null safety migration eventually.

Dersh · 2022-10-13T22:17:54Z

This is very useful for Emoji support. Simple task to highlight search emoji in the text text to much time/code lines without UTF16 support

timobaehr · 2022-12-27T19:19:07Z

I extracted these lines of code from utf library (https://pub.dev/packages/utf). Not sure about legal requirements. The code is licensed under BSD-3.

/// Invalid codepoints or encodings may be substituted with the value U+fffd.
const int _UNICODE_REPLACEMENT_CHARACTER_CODEPOINT = 0xfffd;

const int _UNICODE_BYTE_ZERO_MASK = 0xff;
const int _UNICODE_BYTE_ONE_MASK = 0xff00;

const int _UNICODE_VALID_RANGE_MAX = 0x10ffff;
const int _UNICODE_PLANE_ONE_MAX = 0xffff;

const int _UNICODE_UTF16_RESERVED_LO = 0xd800;
const int _UNICODE_UTF16_RESERVED_HI = 0xdfff;
const int _UNICODE_UTF16_OFFSET = 0x10000;
const int _UNICODE_UTF16_SURROGATE_UNIT_0_BASE = 0xd800;
const int _UNICODE_UTF16_SURROGATE_UNIT_1_BASE = 0xdc00;
const int _UNICODE_UTF16_HI_MASK = 0xffc00;
const int _UNICODE_UTF16_LO_MASK = 0x3ff;

/// Produce a list of UTF-16LE encoded bytes. This method produces UTF-16LE
/// bytes with no BOM.
List<int> encodeUtf16le(String str) {
  final utf16CodeUnits = _stringToUtf16CodeUnits(str);
  final encoding = List<int>.filled(2 * utf16CodeUnits.length, -1);
  var i = 0;
  for (final unit in utf16CodeUnits) {
    encoding[i++] = unit & _UNICODE_BYTE_ZERO_MASK;
    encoding[i++] = (unit & _UNICODE_BYTE_ONE_MASK) >> 8;
  }
  return encoding;
}

List<int> _stringToUtf16CodeUnits(String str) {
  return codepointsToUtf16CodeUnits(str.codeUnits);
}

/// Encode code points as UTF16 code units.
List<int> codepointsToUtf16CodeUnits(List<int> codepoints,
    {int offset = 0,
      int? length,
      int replacementCodepoint = _UNICODE_REPLACEMENT_CHARACTER_CODEPOINT}) {
  final listRange = codepoints;
  var encodedLength = 0;
  for (final value in listRange) {
    if ((value >= 0 && value < _UNICODE_UTF16_RESERVED_LO) ||
        (value > _UNICODE_UTF16_RESERVED_HI && value <= _UNICODE_PLANE_ONE_MAX)) {
      encodedLength++;
    } else if (value > _UNICODE_PLANE_ONE_MAX &&
        value <= _UNICODE_VALID_RANGE_MAX) {
      encodedLength += 2;
    } else {
      encodedLength++;
    }
  }

  final codeUnitsBuffer = List<int>.filled(encodedLength, -1);
  var j = 0;
  for (final value in listRange) {
    if ((value >= 0 && value < _UNICODE_UTF16_RESERVED_LO) ||
        (value > _UNICODE_UTF16_RESERVED_HI && value <= _UNICODE_PLANE_ONE_MAX)) {
      codeUnitsBuffer[j++] = value;
    } else if (value > _UNICODE_PLANE_ONE_MAX &&
        value <= _UNICODE_VALID_RANGE_MAX) {
      var base = value - _UNICODE_UTF16_OFFSET;
      codeUnitsBuffer[j++] = _UNICODE_UTF16_SURROGATE_UNIT_0_BASE +
          ((base & _UNICODE_UTF16_HI_MASK) >> 10);
      codeUnitsBuffer[j++] =
          _UNICODE_UTF16_SURROGATE_UNIT_1_BASE + (base & _UNICODE_UTF16_LO_MASK);
    } else {
      codeUnitsBuffer[j++] = replacementCodepoint;
    }
  }
  return codeUnitsBuffer;
}

lrhn added the type-enhancement A request for a change that isn't a bug label Sep 9, 2020

ueman mentioned this issue Jun 24, 2024

Support for localization in PkPass files ueman/passkit#8

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for UTF-16 #30

Add support for UTF-16 #30

natebosch commented Jul 29, 2020

lrhn commented Sep 9, 2020

natebosch commented Sep 9, 2020

lrhn commented Sep 10, 2020 •

edited

Loading

michalt commented Dec 2, 2021

Dersh commented Oct 13, 2022

timobaehr commented Dec 27, 2022

Add support for UTF-16 #30

Add support for UTF-16 #30

Comments

natebosch commented Jul 29, 2020

lrhn commented Sep 9, 2020

natebosch commented Sep 9, 2020

lrhn commented Sep 10, 2020 • edited Loading

michalt commented Dec 2, 2021

Dersh commented Oct 13, 2022

timobaehr commented Dec 27, 2022

lrhn commented Sep 10, 2020 •

edited

Loading