Skip to content
This repository has been archived by the owner on Aug 17, 2019. It is now read-only.

fix: strip BOM when reading JSON files #8

Merged
merged 4 commits into from
Sep 6, 2017
Merged

fix: strip BOM when reading JSON files #8

merged 4 commits into from
Sep 6, 2017

Conversation

rmg
Copy link
Contributor

@rmg rmg commented Sep 2, 2017

fs.readFile* don't strip the BOM header, even when specifying 'utf8' as
the encoding, and JSON.parse() doesn't handle it either. There are
technically a bunch of BOM indicators, but this is the one seen most
commonly and actually appears in a number of package.json files in the
wild.

See nodejs/node#6924, nodejs/node#3040 for background.

@rmg
Copy link
Contributor Author

rmg commented Sep 2, 2017

Rebased after #6 was merged.

@rmg
Copy link
Contributor Author

rmg commented Sep 2, 2017

I don't have a large sample size, but it looks like among my local cache, the problem file is a singular occurrence:

$ find ~/.npm/_cacache/content-v2/sha1 -type file | while read f; do \
   tar -C ~/deleteme/bom --strip-components 1 -xUzf $f package/package.json \
    && file ~/deleteme/bom/package.json; \
  done | cut -d: -f2 | sort | uniq -c
2810  ASCII text
  56  ASCII text, with CRLF line terminators
   7  ASCII text, with very long lines
   1  ASCII text, with very long lines, with no line terminators
   1  HTML document text, ASCII text
   1  UTF-8 Unicode (with BOM) text
 170  UTF-8 Unicode text
   1  UTF-8 Unicode text, with CRLF line terminators
   2  UTF-8 Unicode text, with very long lines

@rmg
Copy link
Contributor Author

rmg commented Sep 3, 2017

Went hunting to see how npm does this, in case it has a clever way of handling it.. Turns out it is basically doing the same thing: https://github.com/npm/npm/blob/latest/lib/utils/parse-json.js

if (content.charCodeAt(0) === 0xFEFF) {
  content = content.slice(1)
}

@rmg
Copy link
Contributor Author

rmg commented Sep 3, 2017

@zkat if this feels like deja vu.. npm/npm#8724

@rmg rmg force-pushed the stripBOM branch 2 times, most recently from 66a3363 to 6fb3db0 Compare September 3, 2017 02:59
@zkat
Copy link
Owner

zkat commented Sep 5, 2017

This LGTM.

Can we get a test for it, too?

Copy link
Owner

@zkat zkat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, but needs a test (for all of npm-shrinkwrap.json, package-lock.json, and package.json, imo)

@rmg
Copy link
Contributor Author

rmg commented Sep 5, 2017

Can we get a test for it, too?

Incoming. I skipped it on the first pass because every time I looked at failing file with vim it "helpfully" stripped the BOM for me.

Looks good, but needs a test (for all of npm-shrinkwrap.json, package-lock.json, and package.json, imo)

Is it really necessary to test 3 different filenames in a unit test for readJson?

fs.readFile* don't strip the BOM header, even when specifying 'utf8' as
the encoding, and JSON.parse() doesn't handle it either. There are
technically a bunch of BOM indicators, but this is the one seen most
commonly and actually appears in a number of package.json files in the
wild.

See nodejs/node#6924, nodejs/node#3040 for background.
@rmg
Copy link
Contributor Author

rmg commented Sep 5, 2017

@zkat I've rebased and added a unit test for for readJson.

@rmg rmg mentioned this pull request Sep 5, 2017
@zkat zkat merged commit 2529149 into zkat:latest Sep 6, 2017
@rmg rmg deleted the stripBOM branch September 6, 2017 15:10
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants