-
Notifications
You must be signed in to change notification settings - Fork 282
Home
- Read Joel Spolsky's Guide to Unicode and Character Sets.
- Keep in mind that all external resources (files, http pages, etc) are byte sequences, and are naturally represented in Node.js program as Buffer-s or streams of Buffer-s, not strings.
- When you read from an external source and would like to convert data to strings:
- Be sure to know the character encoding of the data. In general, it cannot be deduced automatically.
- Provide original
Buffer
-s as the input todecode()
function, as well as the correct encoding name. - If you have strings at some place in your program, then decoding has already happened, likely using 'utf-8' encoding. You cannot convert it to another encoding at this stage. You need to get the original
Buffer
s,concat()
them if needed, and pass these to iconv-lite. See more details. - It is tricky to convert encodings when you get data as a Node stream. In these cases, use Streaming API (e.g.
iconv. decodeStream()
) to make sure that the boundary cases are handled.
- When you write to an external resource:
- Decide which encoding you would like to use. Most popular and safe is utf-8, and this is the default in Node.
- Use Streaming API if you work with streams.
- If you don't encode strings yourself, then Node.js will do that for you, with default encoding.
- FYI, javascript strings are stored in memory as a UTF-16 encoding.
- If you work with Chinese ideographs or rare characters outside Basic Multilingual Plane, be sure to familiarize yourself with Surrogate pairs. They can be a pain to work with.
Q: How encoding names are matched?
A: 1) They are lowercased, all non-alphanumeric characters are removed, 2) used as a key in iconv.encodings
object to retrieve the codec.
Q: How do I add aliases to encodings?
A: In your project, iconv.encodings['newalias'] = 'encoding'
. Alias must be lowercase and have all non-alphanum characters removed.
Q: How do I add a new single-byte encoding?
A: See encodings/sbcs-data.js for an example of 'maccenteuro' encoding.
Q: How do I add a new multi-byte encoding?
A: See generation/gen-dbcs.js and encodings/dbcs-data.js for how it's done. Just add sources for your encoding there. Current multi-byte codec is very versatile, should be enough for most encodings.
Q: What is the format of tables (encodings/tables/*)?
A: It is a JSON array of chunks. Each chunk represents a continuous mapping from multibyte encoding to unicode. First element of a chunk is a hexadecimal 'address': what multibyte code corresponds to the chunk start. Then, there's a mix of strings and integers. String represents unicode chars that correspond to sequential multibyte codes. Integer represents length of a run of incrementing unicode chars, started from the last char of previous string, a-la RLE encoding.
Q: Why this format was chosen?
A: It's visual. You can easily check that the table is correct. Also, it's quite compact and easy to work with, as it's just JSON.
Q: How do I add a completely new encoding, not reducible to multi-byte? (stateful for example)
A: You'll need to write codec for it. Please look at examples in encodings/internal.js, encodings/sbcs-codec.js and encodings/dbcs-codec.js. Don't forget to write tests.
Q: What directories are necessary for this module to work?
A: Please look at .npmignore for directories that can be ignored. All others are necessary.