|
2 | 2 | |--------|-----------------------------| |
3 | 3 | | Author | @jasnell | |
4 | 4 | | Status | DRAFT | |
5 | | -| Date | 2016-04-20 | |
| 5 | +| Date | 2016-08-11 | |
6 | 6 |
|
7 | 7 | ## Description |
8 | 8 |
|
9 | 9 | The ICU4C library that we use for internationalization contains a significant |
10 | 10 | array of additional functionality not currently exposed by the EcmaScript 402 |
11 | 11 | standard. Some of this additional functionality would be useful to expose via |
12 | | -a new `'icu'` or (`'unicode'`) module. |
| 12 | +a new `'unicode'` module. |
13 | 13 |
|
14 | 14 | ## Interface |
15 | 15 |
|
16 | | -Initially, the `'icu'` module would provide methods for the following: |
| 16 | +Initially, the `'unicode'` module would provide methods for the following: |
17 | 17 |
|
18 | | -1. Character encoding detection. ICU includes code that is able to look at a |
19 | | - stream of bytes and apply heuristics to detect the character encoding in |
20 | | - use. This is not always an exact match but it does a reasonably good job. |
21 | | - We can tune this detection to only look for the character encodings we |
22 | | - support in Core (ascii, iso8859-1, utf8 and utf16-le). Two specific APIs |
23 | | - would be exposed by the `'icu'` module for this capability: |
24 | | - |
25 | | -```js |
26 | | -const icu = require('icu'); |
27 | | - |
28 | | -// Detect the encoding for a given buffer or string. |
29 | | -// Returns a string with the most likely match. |
30 | | -icu.detectEncoding(myBuffer); |
31 | | -icu.detectEncoding(myString); |
32 | | - |
33 | | -// Detect the encoding for a given buffer or string. |
34 | | -// Returns an object whose keys are the detected |
35 | | -// encodings and whose values are a confidence value |
36 | | -// provided by ICU. The higher the confidence value, |
37 | | -// the better the match. |
38 | | -const encs = icu.detectEncodings(myBuffer); |
39 | | -console.log(encs); |
40 | | - // Prints something like {'ascii': 90, 'utf8': 15} |
41 | | -``` |
42 | | - |
43 | | -This mechanism is useful when working with data that might be in multiple |
44 | | -character sets (such as filenames on Linux, or reading through multiple |
45 | | -files in a directory). |
46 | | - |
47 | | -``` |
48 | | -const data = getDataSomehow(); |
49 | | -const buffer = Buffer.from(data, icu.detectEncoding(data)); |
50 | | -``` |
51 | | - |
52 | | -2. One-Shot and Streaming Buffer re-encoding. ICU includes code for converting |
| 18 | +1. One-Shot and Streaming Buffer re-encoding. ICU includes code for converting |
53 | 19 | from one encoding to another. This is similar to what is provided by `iconv` |
54 | | - but it is built in to ICU4C. The `'icu'` module would include converters for |
55 | | - *only* the character encodings directly supported by core. Developers would |
56 | | - continue to use `iconv` or `iconv-lite` for more exotic things. |
| 20 | + but it is built in to ICU4C. The `'unicode'` module would include converters |
| 21 | + for *only* the character encodings directly supported by core. Developers |
| 22 | + would continue to use `iconv` or `iconv-lite` (or similar) for more exotic |
| 23 | + things. |
57 | 24 |
|
58 | 25 | ```js |
59 | | -const icu = require('icu'); |
| 26 | +const unicode = require('unicode'); |
60 | 27 |
|
61 | 28 | // One-shot conversion. Converts the entire Buffer in one go. |
62 | 29 | // Assumes that the Buffer is properly aligned on UFT-8 boundaries |
63 | 30 | const myBuffer = Buffer.from(getUtf8DataSomehow(), 'utf8'); |
64 | | -const newBuffer = icu.reencode(myBuffer, 'utf8', 'ucs2'); |
65 | | - |
| 31 | +const newBuffer = unicode.transcode(myBuffer, 'utf8', 'ucs2'); |
66 | 32 |
|
67 | 33 | // Streaming conversion |
68 | | -const convertStream = icu.createConverter('utf8', 'ucs2'); |
69 | | -convertStream.on('data', (chunk) => { |
| 34 | +const transcodeStream = icu.createTranscoder('utf8', 'ucs2'); |
| 35 | +transcodeStream.on('data', (chunk) => { |
70 | 36 | // chunk is a UTF-16 (ucs2) encoded buffer |
71 | 37 | }); |
72 | 38 | // Writing UTF-8 data |
73 | | -convertStream.write(getUtf8DataSomehow()); |
74 | | -``` |
75 | | - |
76 | | -Additional convenience methods would be attached to `Buffer.prototype`: |
77 | | - |
78 | | -``` |
79 | | -const myBuffer = Buffer.from(getUtf8DataShow(), 'uf8'); |
80 | | -const newBuffer = myBuffer.reencode('utf8', 'ucs2'); |
| 39 | +transcodeStream.write(getUtf8DataSomehow()); |
81 | 40 | ``` |
82 | 41 |
|
83 | 42 | Again, this would ONLY support the encodings for which we already have built-in |
84 | | -support in core (acsii, iso8859-1, utf8 and utf16). This does not expand the |
85 | | -encoding support in core so `iconv` and `iconv-lite` would still be necessary. |
| 43 | +support in core (acsii, iso8859-1, utf8 and utf16le). |
86 | 44 |
|
87 | | -3. UTF-8 and UTF-16 aware `codePointAt()` and `charAt()` methods for `Buffer`. |
| 45 | +2. UTF-8 and UTF-16 aware `codePointAt()` and `charAt()` methods for `Buffer`. |
88 | 46 | This one is pretty straightforward. They would return either the Unicode |
89 | 47 | codepoint or the character at the given byte offset even if the byte offset |
90 | 48 | is not on a UTF-8 or UTF-16 lead byte. These are intended to be symmetrical |
91 | 49 | with `String.prototype.codePointAt()` and `String.prototype.charAt()` |
92 | 50 |
|
93 | | -``` |
94 | | -const icu = require('icu'); |
| 51 | +```js |
| 52 | +const unicode = require('unicode'); |
95 | 53 |
|
96 | 54 | const myBuffer = Buffer.from('a€bc', 'utf8'); |
97 | 55 |
|
98 | | -console.log(icu.codePointAt(myBuffer, 1, 'utf8')); |
99 | | -// or |
100 | | -console.log(myBuffer.codePointAt(1, 'utf8')); |
| 56 | +console.log(unicode.codePointAt(myBuffer, 1, 'utf8')); |
101 | 57 |
|
102 | | -console.log(icu.charAt(myBuffer, 1, 'utf8')); |
103 | | -// or |
104 | | -console.log(myBuffer.charAt(1, 'utf8')); |
| 58 | +console.log(unicode.charAt(myBuffer, 1, 'utf8')); |
105 | 59 | ``` |
106 | 60 |
|
107 | | -4. UTF-16 and UTF-8 aware `slice()` for `Buffer`. This is similar to the |
| 61 | +3. UTF-16 and UTF-8 aware `slice()` for `Buffer`. This is similar to the |
108 | 62 | existing `Buffer.prototype.slice()` except that, rather than byte offsets, |
109 | 63 | the `start` and `end` are codepoint/character offsets This would make it |
110 | 64 | symmetrical with `String.prototype.slice()` but for Buffers. The advantage |
111 | 65 | is that this allows the Buffer to be sliced in a way that ensures proper |
112 | 66 | alignment with UTF-8 or UTF-16 encodings. |
113 | 67 |
|
114 | | -``` |
115 | | -const icu = require('icu'); |
| 68 | +```js |
| 69 | +const unicode = require('unicode'); |
116 | 70 |
|
117 | 71 | const myBuffer = Buffer.from('a€bc', 'utf8'); |
118 | 72 |
|
119 | | -icu.slice(myBuffer, 'utf8', 1, 2); // returns a Buffer with € |
120 | | -// or |
121 | | -Buffer.slice(1, 2, 'utf8'); // returns a Buffer with € |
| 73 | +unicode.slice(myBuffer, 'utf8', 1, 3); // returns a Buffer with the utf8 |
| 74 | + // encoding of €b |
122 | 75 | ``` |
123 | 76 |
|
124 | 77 | *Passing in either `ascii` or `binary` would fallback to the current |
|
0 commit comments