Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OS X 10.11: system calls return non-normalized unicode strings #2165

Closed
silverwind opened this issue Jul 11, 2015 · 71 comments
Closed

OS X 10.11: system calls return non-normalized unicode strings #2165

silverwind opened this issue Jul 11, 2015 · 71 comments
Labels
fs Issues and PRs related to the fs subsystem / file system. i18n-api Issues and PRs related to the i18n implementation. macos Issues and PRs related to the macOS platform / OSX. test Issues and PRs related to the tests.

Comments

@silverwind
Copy link
Contributor

test/sequential/test-chdir.js persistently fails on OS X 10.11 because the output of process.cwd() doesn't match the path we're doing process.chdir() on before.

Here's a reduced test case (npm i mkdirp hexy):

const mkdirp = require('mkdirp');
const hex = require('hexy').hexy;
const dir = __dirname + '/weird \uc3a4\uc3ab\uc3af characters \u00e1\u00e2\u00e3';

mkdirp.sync(dir);
process.chdir(dir);

console.log(hex(new Buffer(dir)));
console.log(hex(new Buffer(process.cwd())));
console.log(dir.length, dir);
console.log(process.cwd().length, process.cwd());

The strings look identical on the terminal, but the bytes differ. Here's the output:

00000000: 2f55 7365 7273 2f73 696c 7665 7277 696e  /Users/silverwin
00000010: 642f 6769 742f 6368 6469 722f 7765 6972  d/git/chdir/weir
00000020: 6420 ec8e a4ec 8eab ec8e af20 6368 6172  d.l.$l.+l./.char
00000030: 6163 7465 7273 20c3 a1c3 a2c3 a3         acters.C!C"C#

00000000: 2f55 7365 7273 2f73 696c 7665 7277 696e  /Users/silverwin
00000010: 642f 6769 742f 6368 6469 722f 7765 6972  d/git/chdir/weir
00000020: 6420 e184 8ae1 85a7 e186 abe1 848a e185  d.a..a.'a.+a..a.
00000030: a7e1 86b2 e184 8ae1 85a7 e186 b620 6368  'a.2a..a.'a.6.ch
00000040: 6172 6163 7465 7273 2061 cc81 61cc 8261  aracters.aL.aL.a
00000050: cc83                                     L.

52 '/Users/silverwind/git/chdir/weird 쎤쎫쎯 characters áâã'
61 '/Users/silverwind/git/chdir/weird 쎤쎫쎯 characters áâã'

cc: @evanlucas

@silverwind silverwind added the process Issues and PRs related to the process subsystem. label Jul 11, 2015
@silverwind
Copy link
Contributor Author

Just for comparision, could someone post the output of above script on 10.10? Maybe @Fishrock123?

@targos
Copy link
Member

targos commented Jul 12, 2015

output on 10.10.4:

00000000: 2f55 7365 7273 2f6d 7a61 7373 6f2f 7465  /Users/mzasso/te
00000010: 7374 2f77 6569 7264 20ec 8ea4 ec8e abec  st/weird.l.$l.+l
00000020: 8eaf 2063 6861 7261 6374 6572 7320 c3a1  ./.characters.C!
00000030: c3a2 c3a3                                C"C#

00000000: 2f55 7365 7273 2f6d 7a61 7373 6f2f 7465  /Users/mzasso/te
00000010: 7374 2f77 6569 7264 20ec 8ea4 ec8e abec  st/weird.l.$l.+l
00000020: 8eaf 2063 6861 7261 6374 6572 7320 c3a1  ./.characters.C!
00000030: c3a2 c3a3                                C"C#

43 '/Users/mzasso/test/weird 쎤쎫쎯 characters áâã'
43 '/Users/mzasso/test/weird 쎤쎫쎯 characters áâã'

@Fishrock123
Copy link
Contributor

@silverwind Maybe we can check what is at the code points using String.prototype.codePointAt(n) or similar? There are different ways of denoting unicode via code points, and it looks like El Capitan is expanding them. If so, that may actually be an OS bug.

@targos
Copy link
Member

targos commented Jul 12, 2015

@silverwind note that your script has a bug: you should use mkdirp.sync or it fails the first time because the directory doesn't exist yet when process.chdir is executed

@targos
Copy link
Member

targos commented Jul 12, 2015

Perhaps this can help:

const mkdirp = require('mkdirp');
const dir = __dirname + '/weird \uc3a4\uc3ab\uc3af characters \u00e1\u00e2\u00e3';

mkdirp.sync(dir);
process.chdir(dir);

function getChars(str) {
  var chars = [];
  for (var c of str) chars.push(c);
  return chars;
}

var dirC = getChars(dir);
var cwdC = getChars(process.cwd());

console.log(dirC.length);
console.log(cwdC.length);

for (var i = 0; i < dirC.length; i++) {
  if (dirC[i].codePointAt(0) !== cwdC[i].codePointAt(0))
    throw `Different code point at ${i}: ${dirC[i]} - ${cwdC[i]}`;
}

console.log('strings are identical');

On 10.10, strings are identical.

@silverwind
Copy link
Contributor Author

@targos fixed the mkdirp. Here's your script's output on 10.11:

52
61

/Users/silverwind/git/chdir/codepoints.js:21
    throw `Different code point at ${i}: ${dirC[i]} - ${cwdC[i]}`;
                                                               ^
Different code point at 34: 쎤 - ᄊ

@silverwind
Copy link
Contributor Author

Either the bug is in uv_cwd or (more likely) in the OS. cc: @saghul

@Fishrock123
Copy link
Contributor

@silverwind could you modify it again to make the output run through String.fromCodePoint()?

Edit: Actually, it would be more helpful do get more out than just one of the conflicts... so log instead of throw also.

@silverwind
Copy link
Contributor Author

@Fishrock123 not sure if fromCodePoint is of much help. I don't think I can feed bytes into it. For simplicy, here's just a single character compared:

const mkdirp = require('mkdirp');
const hex = require('hexy').hexy;
const dir = __dirname + '/\uc3a4';

mkdirp.sync(dir);
process.chdir(dir);

const a = dir.substring(__dirname.length);
const b = process.cwd().substring(__dirname.length);

console.log(hex(new Buffer(a)));
console.log(hex(new Buffer(b)));
console.log(a.length, a);
console.log(b.length, b);

Output:

00000000: ec8e a4                                  l.$

00000000: e184 8ae1 85a7 e186 ab                   a..a.'a.+

1 '쎤'
3 '쎤'

It goes from 3 to 9 bytes. What kind of unicode encoding is that?

@Fishrock123
Copy link
Contributor

I meant in logging the points ala

for (var i = 0; i < dirC.length; i++) {
  if (dirC[i].codePointAt(0) !== cwdC[i].codePointAt(0))
    throw `Different code point at ${i}: ${dirC[i]} - ${cwdC[i]}`;
}

@silverwind
Copy link
Contributor Author

@Fishrock123 just comparing the 6 unicode characters (\uc3a4\uc3ab\uc3af\u00e1\u00e2\u00e3):

00000000: ec8e a4ec 8eab ec8e afc3 a1c3 a2c3 a3    l.$l.+l./C!C"C#

00000000: e184 8ae1 85a7 e186 abe1 848a e185 a7e1  a..a.'a.+a..a.'a
00000010: 86b2 e184 8ae1 85a7 e186 b661 cc81 61cc  .2a..a.'a.6aL.aL
00000020: 8261 cc83                                .aL.

6 '쎤쎫쎯áâã'
15 '쎤쎫쎯áâã'
Different code point at 0: 쎤 - ᄊ
Different code point at 1: 쎫 - ᅧ
Different code point at 2: 쎯 - ᆫ
Different code point at 3: á - ᄊ
Different code point at 4: â - ᅧ
Different code point at 5: ã - ᆲ

@targos
Copy link
Member

targos commented Jul 12, 2015

I don't know how chinese works but it's like if OSX decomposed the character into smaller pieces.
Look how the 3 first on the right (ᄊ, ᅧ, ᆫ) can be used together to create the first one on the left (쎤).

What happens with a character like 🚀 '\uD83D\uDE80' ?

@silverwind
Copy link
Contributor Author

@targos lol, didn't notice it actually was the three parts combined. 🚀 looks fine:

00000000: f09f 9a80                                p...

00000000: f09f 9a80                                p...

2 '🚀'
2 '🚀'

@silverwind
Copy link
Contributor Author

Noticed the same "assembling" happens on upper ascii characters like umlauts öäü, where the dots are a separate character following the ascii letter. I've seen this before on bugged webfonts, not sure what this mechanism is called.

@silverwind silverwind added the macos Issues and PRs related to the macOS platform / OSX. label Jul 13, 2015
@silverwind silverwind changed the title process: unicode issue on process.cwd() on OS X 10.11 OS X 10.11: unicode issue on process.cwd() Jul 13, 2015
@alexlamsl
Copy link

I think it's Normalization taking place:
https://en.wikipedia.org/wiki/Unicode_equivalence#Normalization

@saghul
Copy link
Member

saghul commented Jul 13, 2015

@silverwind I think @alexlamsl is right. Here is a simple example in Python:

In [25]: u1 = u'mañana'

In [26]: print u1
mañana

In [28]: u1
Out[28]: u'ma\xf1ana'

In [29]: u2 = unicodedata.normalize('NFD', u1)

In [30]: print u2
mañana

In [31]: u1 == u2
Out[31]: False

In [32]: u2
Out[32]: u'man\u0303ana'

Note how u1 and u2 "look" the same, but u2 is the decomposed canonical representation: https://en.wikipedia.org/wiki/Unicode_equivalence

FWIW, libuv just returns whatever getcwd returns: https://github.com/libuv/libuv/blob/c8eebc93a9efcdcc2913723ccd86b35498cc271f/src/unix/core.c#L636-L650

Not sure if OSX 10.11 is returning the cwd with a different normalization.

@alexlamsl
Copy link

I am trying to see if String.prototype.normalize() would be a good solution to this problem. However 1) I am on Windows and 2) I couldn't get iojs 2.3.4 to give me different results to the strings even in this example:
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/normalize

So unfortunately I can't provide any further assistance, sorry...

@targos
Copy link
Member

targos commented Jul 13, 2015

@alexlamsl in theory String.prototype.normalize() could be a solution but sadly it only works if iojs is built with Intl support. Otherwise it is just a no-op.

@srl295
Copy link
Member

srl295 commented Jul 13, 2015

this tool will show the normalization differences. The "expanded" ones (form D or KD) are hangul jamo, separating the Korean character into its Consonant-Vowel-Consonant forms.

On 10.10.4 the OS is already normalizing the strings:

$ mkdir '쎤쎫쎯' '쎤쎫쎯'
mkdir: 쎤쎫쎯: File exists

But apparently they are also going to form D (probably NFKD?) in cwd in the new version

@silverwind
Copy link
Contributor Author

I've sent some feedback regarding this to Apple. Might as well be an oversight on their side, considering there were unicode changes. The fact that normalization was performed on 10.10 and earlier leads me to believe it's a bug on their side. I hope to hear back from them.

@silverwind silverwind added the confirmed-bug Issues with confirmed bugs. label Jul 13, 2015
@srl295
Copy link
Member

srl295 commented Jul 13, 2015

@silverwind I'm also asking some contacts at Apple. So far I have that "HFS+ has always been quasi NFD". It's possible this is a bugfix. (update) if "NFD" was the right form, then the longer form (as returned by 10.11) is correct.

@silverwind
Copy link
Contributor Author

So that basically means we have to include Intl in our builds (if we want to normalize it in JS-land)?

@targos
Copy link
Member

targos commented Jul 14, 2015

Can we work around it by using the canonical form in the test ?

@silverwind
Copy link
Contributor Author

@targos yes, in fact using the NFD (or NFKD) form '\u110a\u1167\u11ab\u110a\u1167\u11b2\u110a\u1167\u11b6\u0061\u0301\u0061\u0302\u0061\u0303' does work and the strings are equal.

I don't consider this a real solution as it's not consistent across platforms, and I think the right way would be to bundle Intl in releases and use string.prototype.normalize on all values returned by system calls on OS X >= 10.11.

@silverwind silverwind changed the title OS X 10.11: unicode issue on process.cwd() OS X 10.11: system calls return non-normalized unicode strings Jul 14, 2015
@bnoordhuis
Copy link
Member

Quick note, as of v3.1.0, small-icu is enabled in release binaries. If you build from source, you still need to follow the steps that @jorangreef outlined.

@bpasero
Copy link
Contributor

bpasero commented Aug 23, 2015

@jorangreef thanks

@bnoordhuis I just downloaded v3.1.0 and tried out normalize() on my testcase with readdir() and it just seems to work without changing anything (assertion passes). Can you clarify what you mean with the steps I still need to follow?

I would also like to see the sourcecode of the implementation of String.normalize() that is now being used in io.js, if someone could point me to it.

@bnoordhuis
Copy link
Member

If you downloaded the binary from iojs.org, you don't need to do anything.

I would also like to see the sourcecode of the implementation of String.normalize() that is now being used in io.js, if someone could point me to it.

It's... complicated. Start with i18n.js, runtime/runtime-i18n.cc and i18n.cc in deps/v8, in that order. It all ends in V8 calling icu::Normalizer::normalize() from ICU, though.

@bpasero
Copy link
Contributor

bpasero commented Aug 24, 2015

When using String.normalize(), can I always assume that the result I get is readable by io.js when I do a fs operation with it? We already see that on Mac I can pass a NFC or NFD form of a path to the fs APIs and both variants work. But lets assume I always convert to NFC (using String.normalize()) any path I get from the OS. Is there a chance I run into fs issues? I would assume that a string remains untouched when it cannot be normalized to NFC and thus this would not be an issue.

@bpasero
Copy link
Contributor

bpasero commented Aug 24, 2015

Did some more testing. It seems that Windows and Linux file systems allow files with either NFD or NFC form and they are both preserved. On Mac, I can create either a file in NFD or NFC form but in both cases it ends up to be just one file in NFD form. That also means, if any String.normalize() is being used, it must only be used on Mac because the other OS allow for both forms.

I am still a bit puzzled why Mac behaves this way. If you compare this behavior to how case-sensitive file names are handled it would mean that Mac only allows 1 casing and it would always lowercase your filename.

Am I missing something?

@bnoordhuis
Copy link
Member

Did some more testing. It seems that Windows and Linux file systems allow files with either NFD or NFC form and they are both preserved. On Mac, I can create either a file in NFD or NFC form but in both cases it ends up to be just one file in NFD form.

You've probably figured this out already, but to recap:

  1. Windows (or rather, NTFS) stores file names as UTF-16 without further canonicalizing them.
  2. Most Unices, including Linux, treat file names as simple byte arrays without any particular encoding.

OS X is the odd one out, both in that it normalizes file names and in that it uses NFD instead of the more common NFC (although NFD has some properties that make it attractive from a technical perspective, compared to NFC.)

@silverwind
Copy link
Contributor Author

I am still a bit puzzled why Mac behaves this way

Mostly historic reasons it seems. I'll link this Q&A again as it still seems the best resource I could find. In summary, it's a big mess and the returned form can vary based on the filesystem and even on code points in the HFS+ case.

@Fishrock123
Copy link
Contributor

Is there any disadvantage to using string.normalize() on mac? To my knowledge it should to what's correct per system, and will just return the string unmodified if node is built without Intl?

@silverwind
Copy link
Contributor Author

Yes, string.normalize should be a noop without Intl.

Remember thought, that for comparison to work, both strings must be in the same form. If we decide to normalize all file names returned by the system to NFC, there's still the possibilty that the user tries to compare it to NFD obtained through other means, which would fail.

Maybe the best course of action is to advice to do str1.normalize() === str2.normalize() when comparing any unicode strings, regardless of origin.

@jorangreef
Copy link
Contributor

If we decide to normalize all file names returned by the system to NFC

This would not be correct, i.e. the same as making fs.readFile normalize CRLF line endings to LF because people are not aware of the difference. Filenames should be treated as data.

Maybe the best course of action is to advice to do str1.normalize() === str2.normalize() when comparing any unicode strings, regardless of origin.

Spot on. Although, when comparing file system strings, one must take into account whether the file system preserves form or not. e.g. When comparing filenames on Windows or Linux (or on any form-preserving filesystem), do not use normalize to compare, use ===. And on HFS+ or any non-form-preserving filesystem use normalize('NFD') to compare. Windows or Linux will preserve and return NFC or NFD as per Ben's comment above so using normalize on those filesystems would be wrong as two similar looking files could exist in different Unicode forms.

@jorangreef
Copy link
Contributor

Is there any disadvantage to using string.normalize() on mac? To my knowledge it should to what's correct per system, and will just return the string unmodified if node is built without Intl?

@Fishrock123 this would fix the test where normalize works. But to fix the test to work where there is no ICU, it would be better to just change the test to create a NFD form of the directory to start with. That will work on all OS X versions.

@jorangreef
Copy link
Contributor

Did some more testing. It seems that Windows and Linux file systems allow files with either NFD or NFC form and they are both preserved. On Mac, I can create either a file in NFD or NFC form but in both cases it ends up to be just one file in NFD form. That also means, if any String.normalize() is being used, it must only be used on Mac because the other OS allow for both forms.

I am still a bit puzzled why Mac behaves this way. If you compare this behavior to how case-sensitive file names are handled it would mean that Mac only allows 1 casing and it would always lowercase your filename.

Am I missing something?

That's exactly right. Apple may have thought that some users would be confused by different normalization forms, and they also wanted to be form-insensitive, so instead of preserving the form, they just convert abc to ABC and lose the form in the process. Meanwhile, the other file-systems preserve form and are sensitive to form.

One can think of it exactly as dealing with a non-case-preserving file-system.

@silverwind
Copy link
Contributor Author

One can think of it exactly as dealing with a non-case-preserving file-system.

I think a better comparision would be when you think of your data as case-insensitve (you wouldn't use uppercase for files in your project, would you?), and the OS starts to return uppercase for certain letters based on the characters used. It's a mess in my eyes.

@silverwind
Copy link
Contributor Author

WIP fix for this test: #3007
Documentation suggestion: nodejs/docs#42

@silverwind
Copy link
Contributor Author

Fixed the test in 81e98e9. Docs regarding this will follow - see nodejs/docs#42.

silverwind added a commit that referenced this issue Oct 8, 2015
OS X 10.11 changed the unicode normalization form of certain code points
returned by system calls like getcwd() from NFC to NFD which made
results in this test failing.

The consensus of #2165 is to delegate
the task of unicode normalization to the user, and work will continue to
document how to handle unicode in a form-sensitive file system.

PR-URL: #3007
Fixes: #2165
Reviewed-By: Johan Bergström <bugs@bergstroem.nu>
Reviewed-By: Jeremiah Senkpiel <fishrock123@rocketmail.com>
Reviewed-By: Sakthipriyan Vairamani <thechargingvolcano@gmail.com>
@jorangreef
Copy link
Contributor

A guide is now up that explains how to work with filesystems that do Unicode normalization: https://nodejs.org/en/docs/guides/working-with-different-filesystems/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
fs Issues and PRs related to the fs subsystem / file system. i18n-api Issues and PRs related to the i18n implementation. macos Issues and PRs related to the macOS platform / OSX. test Issues and PRs related to the tests.
Projects
None yet
Development

No branches or pull requests

9 participants