Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for UTF-16 #30

Open
natebosch opened this issue Jul 29, 2020 · 6 comments
Open

Add support for UTF-16 #30

natebosch opened this issue Jul 29, 2020 · 6 comments
Labels
type-enhancement A request for a change that isn't a bug

Comments

@natebosch
Copy link
Member

The only way we had to decode UTF-16 previously was package:utf which has been discontinued. We should add a utf16 encoder and decoder here.

@lrhn
Copy link
Member

lrhn commented Sep 9, 2020

There is no great need for a converter, since String.codeUnits and String.fromCharCodes will do the job, but that also means that it should be trivial to implement. It might make sense in some situations, e.g., to use with Stream.transform.

The plain UTF-16 converter should not be an Encoding since it doesn't output bytes.
It might actually make sense to have specialized UTF-16 little-endian and UTF-16 big-endian converters, which might even be Encodings (but it's probably safest to keep them as plain Converters).

@lrhn lrhn added the type-enhancement A request for a change that isn't a bug label Sep 9, 2020
@natebosch
Copy link
Member Author

There is a difference between the package:utf implementation of decodeUtf16 and using String.fromCharCodes.

The former could decode the bytes [0xFE, 0xFF, 0x6C, 0x34] into . To get the same character using String.fromCharCodes you need to change from bytes to charcodes first, it wants as input [0xFEFF, 0x6C34].

@lrhn
Copy link
Member

lrhn commented Sep 10, 2020

Exactly, that's what i was alluding to with a UTF-16 little/big-endian converter, which is a byte to string converter, not a code-unit to string converter. Your example appears to be big-endian (aka network-order).

I would expect a plain Utf16Converter to convert from UTF-16 to String, and UTF-16 is code units, not bytes representing code units.
We can do all of these, but the endian-based converters are likely more useful.

@michalt
Copy link

michalt commented Dec 2, 2021

It seems that we do have a few uses of UTF-16 decoding internally. It's definitely not a high priority (there are literally about 4 uses of this), but we'll probably want to have something in dart:convert to support this use case, as it will block the null safety migration eventually.

@Dersh
Copy link

Dersh commented Oct 13, 2022

This is very useful for Emoji support. Simple task to highlight search emoji in the text text to much time/code lines without UTF16 support

@timobaehr
Copy link

I extracted these lines of code from utf library (https://pub.dev/packages/utf). Not sure about legal requirements. The code is licensed under BSD-3.

/// Invalid codepoints or encodings may be substituted with the value U+fffd.
const int _UNICODE_REPLACEMENT_CHARACTER_CODEPOINT = 0xfffd;

const int _UNICODE_BYTE_ZERO_MASK = 0xff;
const int _UNICODE_BYTE_ONE_MASK = 0xff00;

const int _UNICODE_VALID_RANGE_MAX = 0x10ffff;
const int _UNICODE_PLANE_ONE_MAX = 0xffff;

const int _UNICODE_UTF16_RESERVED_LO = 0xd800;
const int _UNICODE_UTF16_RESERVED_HI = 0xdfff;
const int _UNICODE_UTF16_OFFSET = 0x10000;
const int _UNICODE_UTF16_SURROGATE_UNIT_0_BASE = 0xd800;
const int _UNICODE_UTF16_SURROGATE_UNIT_1_BASE = 0xdc00;
const int _UNICODE_UTF16_HI_MASK = 0xffc00;
const int _UNICODE_UTF16_LO_MASK = 0x3ff;

/// Produce a list of UTF-16LE encoded bytes. This method produces UTF-16LE
/// bytes with no BOM.
List<int> encodeUtf16le(String str) {
  final utf16CodeUnits = _stringToUtf16CodeUnits(str);
  final encoding = List<int>.filled(2 * utf16CodeUnits.length, -1);
  var i = 0;
  for (final unit in utf16CodeUnits) {
    encoding[i++] = unit & _UNICODE_BYTE_ZERO_MASK;
    encoding[i++] = (unit & _UNICODE_BYTE_ONE_MASK) >> 8;
  }
  return encoding;
}

List<int> _stringToUtf16CodeUnits(String str) {
  return codepointsToUtf16CodeUnits(str.codeUnits);
}

/// Encode code points as UTF16 code units.
List<int> codepointsToUtf16CodeUnits(List<int> codepoints,
    {int offset = 0,
      int? length,
      int replacementCodepoint = _UNICODE_REPLACEMENT_CHARACTER_CODEPOINT}) {
  final listRange = codepoints;
  var encodedLength = 0;
  for (final value in listRange) {
    if ((value >= 0 && value < _UNICODE_UTF16_RESERVED_LO) ||
        (value > _UNICODE_UTF16_RESERVED_HI && value <= _UNICODE_PLANE_ONE_MAX)) {
      encodedLength++;
    } else if (value > _UNICODE_PLANE_ONE_MAX &&
        value <= _UNICODE_VALID_RANGE_MAX) {
      encodedLength += 2;
    } else {
      encodedLength++;
    }
  }

  final codeUnitsBuffer = List<int>.filled(encodedLength, -1);
  var j = 0;
  for (final value in listRange) {
    if ((value >= 0 && value < _UNICODE_UTF16_RESERVED_LO) ||
        (value > _UNICODE_UTF16_RESERVED_HI && value <= _UNICODE_PLANE_ONE_MAX)) {
      codeUnitsBuffer[j++] = value;
    } else if (value > _UNICODE_PLANE_ONE_MAX &&
        value <= _UNICODE_VALID_RANGE_MAX) {
      var base = value - _UNICODE_UTF16_OFFSET;
      codeUnitsBuffer[j++] = _UNICODE_UTF16_SURROGATE_UNIT_0_BASE +
          ((base & _UNICODE_UTF16_HI_MASK) >> 10);
      codeUnitsBuffer[j++] =
          _UNICODE_UTF16_SURROGATE_UNIT_1_BASE + (base & _UNICODE_UTF16_LO_MASK);
    } else {
      codeUnitsBuffer[j++] = replacementCodepoint;
    }
  }
  return codeUnitsBuffer;
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type-enhancement A request for a change that isn't a bug
Projects
None yet
Development

No branches or pull requests

5 participants