Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fs: support different path name encodings #3519

Closed
bnoordhuis opened this issue Oct 25, 2015 · 12 comments
Closed

fs: support different path name encodings #3519

bnoordhuis opened this issue Oct 25, 2015 · 12 comments
Labels
discuss Issues opened for discussions and feedbacks. fs Issues and PRs related to the fs subsystem / file system.

Comments

@bnoordhuis
Copy link
Member

Continuing from #3401, it's clear that the way node.js handles path name encodings is sub-optimal. What is not clear is how to fix it. This issue is for discussing possible solutions.

A quick recap of the current situation:

  • node.js assumes UTF-8 in most - but not all - places.
  • UTF-8 is fine on Windows. Libuv converts UTF-8 to and from UTF-16, which is what the kernel expects.
  • UTF-8 is common but not universal on UNIX systems. Most file systems are character set agnostic, encodings are normally by convention. OS X's HFS+ is the most common exception to the rule.

Considerations:

  • Conversions should be zero-byte safe because most C APIs operate on zero-terminated strings.
  • JS strings are conceptually always UTF-16 but V8 accepts ISO-8859-1, UTF-8 and UTF-16 as input.
  • Conversion (to JS string) from ISO-8859-1 is lossless but conversion from UTF-8 and UTF-16 is not: invalid byte sequences are replaced with U+FFFD.
  • Inversely, conversion to UTF-8 and UTF-16 is lossless but conversion to ISO-8859-1 is not: out-of-range characters wrap around - which can be insecure, see the bullet point about C APIs.
@bnoordhuis bnoordhuis added the fs Issues and PRs related to the fs subsystem / file system. label Oct 25, 2015
@bnoordhuis
Copy link
Member Author

Two possible solutions:

  1. Assume UTF-8 unconditionally. May be reasonable but we probably still need special casing for HFS+. It normalizes file names to NFD. Without intervention the file name you read back may not be what you created.
  2. Assume UTF-8 but let the user override the encoding. This is what we do elsewhere (crypto, http, streams, etc.) so there is precedent. Users can deal with more obscure encodings like CP-1252 by encoding to buffer and post-processing it with iconv or iconv-lite.

@mscdex mscdex added the discuss Issues opened for discussions and feedbacks. label Oct 25, 2015
@Fishrock123
Copy link
Contributor

@bnoordhuis what are some downsides of the proposed solutions?

@bnoordhuis
Copy link
Member Author

  1. Assume UTF-8 unconditionally. Does the wrong thing when the encoding isn't UTF-8.
  2. Assume UTF-8 but let the user override the encoding. How would the user know what the correct encoding is? Not solvable in general but at least it's flexible.

With either solution we could add an environment variable or command line flag to override the default encoding.

@jorangreef
Copy link
Contributor

Regarding the two options:

  1. Assuming UTF-8 should not necessitate a special case for HFS+ after decoding to UTF-8. HFS+ NFD form should come through fine when decoding as UTF-8 and should always be passed on to the user as is, never normalized by Node, for the reasons given here: https://github.com/nodejs/new.nodejs.org/blob/master/locale/en/docs/guides/working-with-different-filesystems.md
  2. This would be ideal.

I don't think there should be a global environment variable or command line flag to override the encoding (it's tempting), because encodings are all relative to the filesystem in play at the mounted subtree the user happens to be working with. For example, a user may use a single Node process to work with multiple different filesystems mounted in /Volumes all with different encodings for example. The encoding override should just be an optional functional argument (e.g. as for fs.writeFile or fs.readFile etc.).

@bnoordhuis
Copy link
Member Author

NFD form should come through fine when decoding as UTF-8 and should always be passed on to the user as is

I'm ambivalent. The NFC/NFD dichotomy is confusing IMO in that the file you create with fs.open() can end up having a different name when you read it back with fs.readdir().

https://github.com/nodejs/new.nodejs.org/blob/master/locale/en/docs/guides/working-with-different-filesystems.md

Nice write-up!

I don't think there should be a global environment variable or command line flag to override the encoding (it's tempting), because encodings are all relative to the filesystem in play at the mounted subtree the user happens to be working with.

I don't exactly disagree but there is an (IMO reasonable) case to be made for common case convenience if there is going to be a default encoding anyway. It's something we can tackle later though.

@jorangreef
Copy link
Contributor

Thanks Ben!

The NFC/NFD forms are surprising (hopefully the guide will help with that!) but I think it's not technically possible for Node to try and fix HFS+ now, and if Node tried it would be repeating the same mistake HFS+ made (implementing form-insensitivity by sacrificing form-preservation). It would also be equally confusing if Node normalized HFS+ NFD to NFC and users called ls via child_process.exec only to see different filenames. In any event, even if Node wanted to, the roundtrip from HFS+ NFD to NFC to NFD is not lossless because the standard has advanced (Node would have to use the same normalization table baked into HFS+). So I don't think we should bundle it here with encoding choice. Better for users to learn about different Unicode forms and how to compare them insensitively without sacrificing form preservation in the process.

@seishun
Copy link
Contributor

seishun commented Oct 26, 2015

node.js assumes UTF-8 in most - but not all - places.

This is somewhat misleading. UTF-8 is assumed everywhere in fs except one place, where Latin-1 is erroneously assumed.

Without intervention the file name you read back may not be what you created.

You might have already figured it out, but just in case anyone else is confused: it's not possible to make you sure read back the same filename you created. HFS+ normalizes file names when they are created, and it's a lossy conversion. For example, both "도시락" and "도시락" become "도시락", and one can't know afterwards which one was used originally. (The strings all look the same because normalization preserves the "look" of the string, duh. Try comparing their lengths in a JS console.)

Users can deal with more obscure encodings like CP-1252 by encoding to buffer and post-processing it with iconv or iconv-lite.

Has an actual user requested this, or is this pure theory?

@piscisaureus
Copy link
Contributor

Note that both Windows (and Javascript too, for that matter) technically use UCS2 and not UTF16. That means that not all valid Windows filenames are expressible as UTF8.

If we are going to fix this "propertly" libuv should probably use WTF-8 on Windows.

@rvagg
Copy link
Member

rvagg commented Oct 29, 2015

How does this sound for the "Known issues":

  • Unicode characters in filesystem paths are not handled consistently across platforms or Node.js APIs. See #2088, #3401 and #3519.

@bnoordhuis?

@bnoordhuis
Copy link
Member Author

@rvagg Looks good.

@Mithgol
Copy link

Mithgol commented Nov 19, 2015

@piscisaureus Even if libuv starts using WTF-8 on Windows (instead of UTF-8), most people won't notice. Unpaired surrogates are rarities.

jasnell added a commit to jasnell/node that referenced this issue Mar 25, 2016
This makes several changes:

1. Allow path/filename to be passed in as a Buffer on fs methods
2. Add `options.encoding` to fs.readdir, fs.readdirSync, fs.readlink,
   fs.readlinkSync and fs.watch.
3. Documentation updates

For 1... it's now possible to do:

```js
fs.open(Buffer('/fs/foo/bar'), 'w+', (err, fd) => { });
```

For 2...
```js
fs.readdir('/fs/foo/bar', {encoding:'hex'}, (err,list) => { });

fs.readdir('/fs/foo/bar', {encoding:'buffer'}, (err, list) => { });
```

encoding can also be passed as a string

```js
fs.readdir('/fs/foo/bar', 'hex', (err,list) => { });
```

The default encoding is set to UTF8 so this addresses the
discrepency that existed previously between fs.readdir and
fs.watch handling filenames differently.

Fixes: nodejs#2088
Refs: nodejs#3519
Alternate: nodejs#3401
jasnell added a commit that referenced this issue Mar 25, 2016
This makes several changes:

1. Allow path/filename to be passed in as a Buffer on fs methods
2. Add `options.encoding` to fs.readdir, fs.readdirSync, fs.readlink,
   fs.readlinkSync and fs.watch.
3. Documentation updates

For 1... it's now possible to do:

```js
fs.open(Buffer('/fs/foo/bar'), 'w+', (err, fd) => { });
```

For 2...
```js
fs.readdir('/fs/foo/bar', {encoding:'hex'}, (err,list) => { });

fs.readdir('/fs/foo/bar', {encoding:'buffer'}, (err, list) => { });
```

encoding can also be passed as a string

```js
fs.readdir('/fs/foo/bar', 'hex', (err,list) => { });
```

The default encoding is set to UTF8 so this addresses the
discrepency that existed previously between fs.readdir and
fs.watch handling filenames differently.

Fixes: #2088
Refs: #3519
PR-URL: #5616
Reviewed-By: Ben Noordhuis <info@bnoordhuis.nl>
Reviewed-By: Trevor Norris <trev.norris@gmail.com>
@jasnell
Copy link
Member

jasnell commented Apr 3, 2016

Resolved now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discuss Issues opened for discussions and feedbacks. fs Issues and PRs related to the fs subsystem / file system.
Projects
None yet
Development

No branches or pull requests

9 participants