Skip to content
This repository was archived by the owner on Jul 31, 2018. It is now read-only.

Commit 71b757c

Browse files
committed
Updating
1 parent bedc674 commit 71b757c

File tree

1 file changed

+24
-71
lines changed

1 file changed

+24
-71
lines changed

XXX-icu-module.md

Lines changed: 24 additions & 71 deletions
Original file line numberDiff line numberDiff line change
@@ -2,123 +2,76 @@
22
|--------|-----------------------------|
33
| Author | @jasnell |
44
| Status | DRAFT |
5-
| Date | 2016-04-20 |
5+
| Date | 2016-08-11 |
66

77
## Description
88

99
The ICU4C library that we use for internationalization contains a significant
1010
array of additional functionality not currently exposed by the EcmaScript 402
1111
standard. Some of this additional functionality would be useful to expose via
12-
a new `'icu'` or (`'unicode'`) module.
12+
a new `'unicode'` module.
1313

1414
## Interface
1515

16-
Initially, the `'icu'` module would provide methods for the following:
16+
Initially, the `'unicode'` module would provide methods for the following:
1717

18-
1. Character encoding detection. ICU includes code that is able to look at a
19-
stream of bytes and apply heuristics to detect the character encoding in
20-
use. This is not always an exact match but it does a reasonably good job.
21-
We can tune this detection to only look for the character encodings we
22-
support in Core (ascii, iso8859-1, utf8 and utf16-le). Two specific APIs
23-
would be exposed by the `'icu'` module for this capability:
24-
25-
```js
26-
const icu = require('icu');
27-
28-
// Detect the encoding for a given buffer or string.
29-
// Returns a string with the most likely match.
30-
icu.detectEncoding(myBuffer);
31-
icu.detectEncoding(myString);
32-
33-
// Detect the encoding for a given buffer or string.
34-
// Returns an object whose keys are the detected
35-
// encodings and whose values are a confidence value
36-
// provided by ICU. The higher the confidence value,
37-
// the better the match.
38-
const encs = icu.detectEncodings(myBuffer);
39-
console.log(encs);
40-
// Prints something like {'ascii': 90, 'utf8': 15}
41-
```
42-
43-
This mechanism is useful when working with data that might be in multiple
44-
character sets (such as filenames on Linux, or reading through multiple
45-
files in a directory).
46-
47-
```
48-
const data = getDataSomehow();
49-
const buffer = Buffer.from(data, icu.detectEncoding(data));
50-
```
51-
52-
2. One-Shot and Streaming Buffer re-encoding. ICU includes code for converting
18+
1. One-Shot and Streaming Buffer re-encoding. ICU includes code for converting
5319
from one encoding to another. This is similar to what is provided by `iconv`
54-
but it is built in to ICU4C. The `'icu'` module would include converters for
55-
*only* the character encodings directly supported by core. Developers would
56-
continue to use `iconv` or `iconv-lite` for more exotic things.
20+
but it is built in to ICU4C. The `'unicode'` module would include converters
21+
for *only* the character encodings directly supported by core. Developers
22+
would continue to use `iconv` or `iconv-lite` (or similar) for more exotic
23+
things.
5724

5825
```js
59-
const icu = require('icu');
26+
const unicode = require('unicode');
6027

6128
// One-shot conversion. Converts the entire Buffer in one go.
6229
// Assumes that the Buffer is properly aligned on UFT-8 boundaries
6330
const myBuffer = Buffer.from(getUtf8DataSomehow(), 'utf8');
64-
const newBuffer = icu.reencode(myBuffer, 'utf8', 'ucs2');
65-
31+
const newBuffer = unicode.transcode(myBuffer, 'utf8', 'ucs2');
6632

6733
// Streaming conversion
68-
const convertStream = icu.createConverter('utf8', 'ucs2');
69-
convertStream.on('data', (chunk) => {
34+
const transcodeStream = icu.createTranscoder('utf8', 'ucs2');
35+
transcodeStream.on('data', (chunk) => {
7036
// chunk is a UTF-16 (ucs2) encoded buffer
7137
});
7238
// Writing UTF-8 data
73-
convertStream.write(getUtf8DataSomehow());
74-
```
75-
76-
Additional convenience methods would be attached to `Buffer.prototype`:
77-
78-
```
79-
const myBuffer = Buffer.from(getUtf8DataShow(), 'uf8');
80-
const newBuffer = myBuffer.reencode('utf8', 'ucs2');
39+
transcodeStream.write(getUtf8DataSomehow());
8140
```
8241

8342
Again, this would ONLY support the encodings for which we already have built-in
84-
support in core (acsii, iso8859-1, utf8 and utf16). This does not expand the
85-
encoding support in core so `iconv` and `iconv-lite` would still be necessary.
43+
support in core (acsii, iso8859-1, utf8 and utf16le).
8644

87-
3. UTF-8 and UTF-16 aware `codePointAt()` and `charAt()` methods for `Buffer`.
45+
2. UTF-8 and UTF-16 aware `codePointAt()` and `charAt()` methods for `Buffer`.
8846
This one is pretty straightforward. They would return either the Unicode
8947
codepoint or the character at the given byte offset even if the byte offset
9048
is not on a UTF-8 or UTF-16 lead byte. These are intended to be symmetrical
9149
with `String.prototype.codePointAt()` and `String.prototype.charAt()`
9250

93-
```
94-
const icu = require('icu');
51+
```js
52+
const unicode = require('unicode');
9553

9654
const myBuffer = Buffer.from('a€bc', 'utf8');
9755

98-
console.log(icu.codePointAt(myBuffer, 1, 'utf8'));
99-
// or
100-
console.log(myBuffer.codePointAt(1, 'utf8'));
56+
console.log(unicode.codePointAt(myBuffer, 1, 'utf8'));
10157

102-
console.log(icu.charAt(myBuffer, 1, 'utf8'));
103-
// or
104-
console.log(myBuffer.charAt(1, 'utf8'));
58+
console.log(unicode.charAt(myBuffer, 1, 'utf8'));
10559
```
10660

107-
4. UTF-16 and UTF-8 aware `slice()` for `Buffer`. This is similar to the
61+
3. UTF-16 and UTF-8 aware `slice()` for `Buffer`. This is similar to the
10862
existing `Buffer.prototype.slice()` except that, rather than byte offsets,
10963
the `start` and `end` are codepoint/character offsets This would make it
11064
symmetrical with `String.prototype.slice()` but for Buffers. The advantage
11165
is that this allows the Buffer to be sliced in a way that ensures proper
11266
alignment with UTF-8 or UTF-16 encodings.
11367

114-
```
115-
const icu = require('icu');
68+
```js
69+
const unicode = require('unicode');
11670

11771
const myBuffer = Buffer.from('a€bc', 'utf8');
11872

119-
icu.slice(myBuffer, 'utf8', 1, 2); // returns a Buffer with €
120-
// or
121-
Buffer.slice(1, 2, 'utf8'); // returns a Buffer with €
73+
unicode.slice(myBuffer, 'utf8', 1, 3); // returns a Buffer with the utf8
74+
// encoding of €b
12275
```
12376

12477
*Passing in either `ascii` or `binary` would fallback to the current

0 commit comments

Comments
 (0)