fs: support different path name encodings #3519

bnoordhuis · 2015-10-25T22:04:14Z

Continuing from #3401, it's clear that the way node.js handles path name encodings is sub-optimal. What is not clear is how to fix it. This issue is for discussing possible solutions.

A quick recap of the current situation:

node.js assumes UTF-8 in most - but not all - places.
UTF-8 is fine on Windows. Libuv converts UTF-8 to and from UTF-16, which is what the kernel expects.
UTF-8 is common but not universal on UNIX systems. Most file systems are character set agnostic, encodings are normally by convention. OS X's HFS+ is the most common exception to the rule.

Considerations:

Conversions should be zero-byte safe because most C APIs operate on zero-terminated strings.
JS strings are conceptually always UTF-16 but V8 accepts ISO-8859-1, UTF-8 and UTF-16 as input.
Conversion (to JS string) from ISO-8859-1 is lossless but conversion from UTF-8 and UTF-16 is not: invalid byte sequences are replaced with U+FFFD.
Inversely, conversion to UTF-8 and UTF-16 is lossless but conversion to ISO-8859-1 is not: out-of-range characters wrap around - which can be insecure, see the bullet point about C APIs.

bnoordhuis · 2015-10-25T22:27:05Z

Two possible solutions:

Assume UTF-8 unconditionally. May be reasonable but we probably still need special casing for HFS+. It normalizes file names to NFD. Without intervention the file name you read back may not be what you created.
Assume UTF-8 but let the user override the encoding. This is what we do elsewhere (crypto, http, streams, etc.) so there is precedent. Users can deal with more obscure encodings like CP-1252 by encoding to buffer and post-processing it with iconv or iconv-lite.

Fishrock123 · 2015-10-25T23:19:27Z

@bnoordhuis what are some downsides of the proposed solutions?

bnoordhuis · 2015-10-26T06:03:32Z

Assume UTF-8 unconditionally. Does the wrong thing when the encoding isn't UTF-8.
Assume UTF-8 but let the user override the encoding. How would the user know what the correct encoding is? Not solvable in general but at least it's flexible.

With either solution we could add an environment variable or command line flag to override the default encoding.

jorangreef · 2015-10-26T06:27:44Z

Regarding the two options:

Assuming UTF-8 should not necessitate a special case for HFS+ after decoding to UTF-8. HFS+ NFD form should come through fine when decoding as UTF-8 and should always be passed on to the user as is, never normalized by Node, for the reasons given here: https://github.com/nodejs/new.nodejs.org/blob/master/locale/en/docs/guides/working-with-different-filesystems.md
This would be ideal.

I don't think there should be a global environment variable or command line flag to override the encoding (it's tempting), because encodings are all relative to the filesystem in play at the mounted subtree the user happens to be working with. For example, a user may use a single Node process to work with multiple different filesystems mounted in /Volumes all with different encodings for example. The encoding override should just be an optional functional argument (e.g. as for fs.writeFile or fs.readFile etc.).

bnoordhuis · 2015-10-26T07:09:03Z

NFD form should come through fine when decoding as UTF-8 and should always be passed on to the user as is

I'm ambivalent. The NFC/NFD dichotomy is confusing IMO in that the file you create with fs.open() can end up having a different name when you read it back with fs.readdir().

https://github.com/nodejs/new.nodejs.org/blob/master/locale/en/docs/guides/working-with-different-filesystems.md

Nice write-up!

I don't think there should be a global environment variable or command line flag to override the encoding (it's tempting), because encodings are all relative to the filesystem in play at the mounted subtree the user happens to be working with.

I don't exactly disagree but there is an (IMO reasonable) case to be made for common case convenience if there is going to be a default encoding anyway. It's something we can tackle later though.

jorangreef · 2015-10-26T07:36:41Z

Thanks Ben!

The NFC/NFD forms are surprising (hopefully the guide will help with that!) but I think it's not technically possible for Node to try and fix HFS+ now, and if Node tried it would be repeating the same mistake HFS+ made (implementing form-insensitivity by sacrificing form-preservation). It would also be equally confusing if Node normalized HFS+ NFD to NFC and users called ls via child_process.exec only to see different filenames. In any event, even if Node wanted to, the roundtrip from HFS+ NFD to NFC to NFD is not lossless because the standard has advanced (Node would have to use the same normalization table baked into HFS+). So I don't think we should bundle it here with encoding choice. Better for users to learn about different Unicode forms and how to compare them insensitively without sacrificing form preservation in the process.

seishun · 2015-10-26T17:30:40Z

node.js assumes UTF-8 in most - but not all - places.

This is somewhat misleading. UTF-8 is assumed everywhere in fs except one place, where Latin-1 is erroneously assumed.

Without intervention the file name you read back may not be what you created.

You might have already figured it out, but just in case anyone else is confused: it's not possible to make you sure read back the same filename you created. HFS+ normalizes file names when they are created, and it's a lossy conversion. For example, both "도시락" and "도시락" become "도시락", and one can't know afterwards which one was used originally. (The strings all look the same because normalization preserves the "look" of the string, duh. Try comparing their lengths in a JS console.)

Users can deal with more obscure encodings like CP-1252 by encoding to buffer and post-processing it with iconv or iconv-lite.

Has an actual user requested this, or is this pure theory?

piscisaureus · 2015-10-28T20:37:26Z

Note that both Windows (and Javascript too, for that matter) technically use UCS2 and not UTF16. That means that not all valid Windows filenames are expressible as UTF8.

If we are going to fix this "propertly" libuv should probably use WTF-8 on Windows.

rvagg · 2015-10-29T02:48:11Z

How does this sound for the "Known issues":

Unicode characters in filesystem paths are not handled consistently across platforms or Node.js APIs. See #2088, #3401 and #3519.

@bnoordhuis?

bnoordhuis · 2015-10-29T11:53:52Z

@rvagg Looks good.

Mithgol · 2015-11-19T08:24:41Z

@piscisaureus Even if libuv starts using WTF-8 on Windows (instead of UTF-8), most people won't notice. Unpaired surrogates are rarities.

This makes several changes: 1. Allow path/filename to be passed in as a Buffer on fs methods 2. Add `options.encoding` to fs.readdir, fs.readdirSync, fs.readlink, fs.readlinkSync and fs.watch. 3. Documentation updates For 1... it's now possible to do: ```js fs.open(Buffer('/fs/foo/bar'), 'w+', (err, fd) => { }); ``` For 2... ```js fs.readdir('/fs/foo/bar', {encoding:'hex'}, (err,list) => { }); fs.readdir('/fs/foo/bar', {encoding:'buffer'}, (err, list) => { }); ``` encoding can also be passed as a string ```js fs.readdir('/fs/foo/bar', 'hex', (err,list) => { }); ``` The default encoding is set to UTF8 so this addresses the discrepency that existed previously between fs.readdir and fs.watch handling filenames differently. Fixes: nodejs#2088 Refs: nodejs#3519 Alternate: nodejs#3401

This makes several changes: 1. Allow path/filename to be passed in as a Buffer on fs methods 2. Add `options.encoding` to fs.readdir, fs.readdirSync, fs.readlink, fs.readlinkSync and fs.watch. 3. Documentation updates For 1... it's now possible to do: ```js fs.open(Buffer('/fs/foo/bar'), 'w+', (err, fd) => { }); ``` For 2... ```js fs.readdir('/fs/foo/bar', {encoding:'hex'}, (err,list) => { }); fs.readdir('/fs/foo/bar', {encoding:'buffer'}, (err, list) => { }); ``` encoding can also be passed as a string ```js fs.readdir('/fs/foo/bar', 'hex', (err,list) => { }); ``` The default encoding is set to UTF8 so this addresses the discrepency that existed previously between fs.readdir and fs.watch handling filenames differently. Fixes: #2088 Refs: #3519 PR-URL: #5616 Reviewed-By: Ben Noordhuis <info@bnoordhuis.nl> Reviewed-By: Trevor Norris <trev.norris@gmail.com>

jasnell · 2016-04-03T20:53:24Z

Resolved now.

bnoordhuis added the fs Issues and PRs related to the fs subsystem / file system. label Oct 25, 2015

bnoordhuis mentioned this issue Oct 25, 2015

fs: decode filenames using UTF-8 in fs.watch #3401

Closed

mscdex added the discuss Issues opened for discussions and feedbacks. label Oct 25, 2015

srl295 mentioned this issue Oct 27, 2015

Investigate FS and other codepage dependencies nodejs/Intl#25

Closed

rvagg mentioned this issue Oct 29, 2015

Release proposal: v5.0.0 #3466

Merged

Fishrock123 mentioned this issue Nov 11, 2015

Propose v5.1.0 #3736

Merged

MylesBorins mentioned this issue Jan 11, 2016

V5.4.1 propose #4626

Merged

jasnell mentioned this issue Mar 9, 2016

fs: add Buffer support in fs methods #5616

Closed

4 tasks

jasnell closed this as completed Apr 3, 2016

bnoordhuis mentioned this issue Apr 12, 2016

fs.watch filename incorrect #6163

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fs: support different path name encodings #3519

fs: support different path name encodings #3519

bnoordhuis commented Oct 25, 2015

bnoordhuis commented Oct 25, 2015

Fishrock123 commented Oct 25, 2015

bnoordhuis commented Oct 26, 2015

jorangreef commented Oct 26, 2015

bnoordhuis commented Oct 26, 2015

jorangreef commented Oct 26, 2015

seishun commented Oct 26, 2015

piscisaureus commented Oct 28, 2015

rvagg commented Oct 29, 2015

bnoordhuis commented Oct 29, 2015

Mithgol commented Nov 19, 2015

jasnell commented Apr 3, 2016

fs: support different path name encodings #3519

fs: support different path name encodings #3519

Comments

bnoordhuis commented Oct 25, 2015

bnoordhuis commented Oct 25, 2015

Fishrock123 commented Oct 25, 2015

bnoordhuis commented Oct 26, 2015

jorangreef commented Oct 26, 2015

bnoordhuis commented Oct 26, 2015

jorangreef commented Oct 26, 2015

seishun commented Oct 26, 2015

piscisaureus commented Oct 28, 2015

rvagg commented Oct 29, 2015

bnoordhuis commented Oct 29, 2015

Mithgol commented Nov 19, 2015

jasnell commented Apr 3, 2016