Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Strip BOM when reading UTF-8 files #21

Merged
merged 1 commit into from
May 25, 2014
Merged

Strip BOM when reading UTF-8 files #21

merged 1 commit into from
May 25, 2014

Conversation

mathiasbynens
Copy link
Contributor

When using gulp-concat, I noticed that the BOM wasn’t stripped from files. When concatenating files together there would be multiple BOMs in the resulting file, which only leads to increased file size for no good reason. (Of course, people shouldn’t use UTF-8 with BOM in the first place, but that’s a separate issue…)

Let’s do what Grunt does and strip BOM by default. Grunt has a setting to override it but IMHO this is not needed. https://github.com/gruntjs/grunt/blob/057eb1c3a7ffdb26b6e2e1a5eb67b262cdd3a71d/lib/grunt/file.js#L238-L241

Thanks to @phated for the pointer to vinyl-fs on IRC.

@mathiasbynens
Copy link
Contributor Author

The point of this PR is to get the discussion going. I’ll add tests etc. if you agree this is a useful addition.

@sindresorhus
Copy link

👍

Could also use https://github.com/sindresorhus/strip-bom (it also supports buffers), but no worries if not.

@mathiasbynens
Copy link
Contributor Author

@sindresorhus Should’ve known you had a package for that! Updated the PR.

I’ve now added a test, too.

Note to self: easy way to create a “UTF-8 with BOM” file for testing:

$ echo 'This file is saved as UTF-8 with BOM.' > test/fixtures/bom.txt
$ vim -e -s -c ':set bomb' -c ':wq' test/fixtures/bom.txt

@sindresorhus
Copy link

@mathiasbynens or just $ uconv --add-signature test.md >! test.md

@mathiasbynens
Copy link
Contributor Author

We should probably only strip any BOM if the input if it’s in UTF-8 – else it becomes impossible for people to use any encodings that rely on the BOM (not that they should…). Or, we could add that preserveBOM setting after all.

@sindresorhus
Copy link

It definitely should only strip BOM from utf8. As far as I know Node only supports reading in files as buffer, utf8 or ascii, not utf16, so not sure how it would ever be a problem?

No, to preserveBOM though. It should just strip BOM from utf8 as it's moot and preserve others.

@mathiasbynens
Copy link
Contributor Author

As far as I know Node only supports reading in files as buffer, utf8 or ascii, not utf16, so not sure how it would ever be a problem?

If a Big-Endian UTF-16-encoded file starts with U+EFBB followed by U+BF00 (or any other code point in the range from U+BF00 to U+BFFF) then the first three bytes in the buffer would be 0xEF 0xBB 0xBF and strip-bom would remove them, not only causing causing data loss but effectively mangling the rest of the buffer.

Since the code point U+EFBB has not been assigned a symbol this specific example is definitely an edge case. But my point is that I shouldn’t just assume that 0xEF 0xBB 0xBF can be stripped in every possible encoding, as it’s not necessarily a BOM.

@sindresorhus
Copy link

Is there anything strip-bom can do to protect against that?

@mathiasbynens
Copy link
Contributor Author

strip-bom is intended to be used for UTF-8 encoded files (right?) so it’s fine. If we want to make strip-bom work for any file, we’d need to detect if if it’s a UTF-8-encoded file somehow, and only strip the BOM in that case.

@sindresorhus
Copy link

Hmm, all the peoples on the internet says to do it like that though...

@sindresorhus
Copy link

Well, yes, but would be nice if consumers didn't have to make sure it's utf8 themselves.

@mathiasbynens
Copy link
Contributor Author

I just meant that you’ll probably want to create a separate is-utf8 package and then make strip-bom depend on it. It seems like we agree here. 👍

sindresorhus added a commit to sindresorhus/strip-bom that referenced this pull request May 12, 2014
@mathiasbynens
Copy link
Contributor Author

Updated this PR to use the new version of strip-bom, which only affects UTF-8 strings and buffers. 👍

For the test files, I took a valid UTF-16-BE/LE files and prepended 0xEF 0xBB 0xBF (i.e. a UTF-8-encoded BOM) to them using a hex editor. This is what they look like:

$ hexdump -C test/fixtures/bom-utf16be.txt
00000000  ef bb bf 00 00 54 00 68  00 69 00 73 00 20 00 66  |.....T.h.i.s. .f|
00000010  00 69 00 6c 00 65 00 20  00 69 00 73 00 20 00 73  |.i.l.e. .i.s. .s|
00000020  00 61 00 76 00 65 00 64  00 20 00 61 00 73 00 20  |.a.v.e.d. .a.s. |
00000030  00 55 00 54 00 46 00 2d  00 31 00 36 00 2d 00 42  |.U.T.F.-.1.6.-.B|
00000040  00 45 00 2e 00 20 00 49  00 74 00 20 00 63 00 6f  |.E... .I.t. .c.o|
00000050  00 6e 00 74 00 61 00 69  00 6e 00 73 00 20 00 73  |.n.t.a.i.n.s. .s|
00000060  00 6f 00 6d 00 65 00 20  00 67 00 61 00 72 00 62  |.o.m.e. .g.a.r.b|
00000070  00 61 00 67 00 65 00 20  00 61 00 74 00 20 00 74  |.a.g.e. .a.t. .t|
00000080  00 68 00 65 00 20 00 73  00 74 00 61 00 72 00 74  |.h.e. .s.t.a.r.t|
00000090  00 20 00 74 00 68 00 61  00 74 00 20 00 6c 00 6f  |. .t.h.a.t. .l.o|
000000a0  00 6f 00 6b 00 73 00 20  00 6c 00 69 00 6b 00 65  |.o.k.s. .l.i.k.e|
000000b0  00 20 00 61 00 20 00 55  00 54 00 46 00 2d 00 38  |. .a. .U.T.F.-.8|
000000c0  00 2d 00 65 00 6e 00 63  00 6f 00 64 00 65 00 64  |.-.e.n.c.o.d.e.d|
000000d0  00 20 00 42 00 4f 00 4d  00 20 00 28 00 62 00 75  |. .B.O.M. .(.b.u|
000000e0  00 74 00 20 00 69 00 73  00 6e 20 19 00 74 00 29  |.t. .i.s.n ..t.)|
000000f0  00 2e 00 0a                                       |....|
000000f4

$ hexdump -C test/fixtures/bom-utf16le.txt
00000000  ef bb bf 00 54 00 68 00  69 00 73 00 20 00 66 00  |....T.h.i.s. .f.|
00000010  69 00 6c 00 65 00 20 00  69 00 73 00 20 00 73 00  |i.l.e. .i.s. .s.|
00000020  61 00 76 00 65 00 64 00  20 00 61 00 73 00 20 00  |a.v.e.d. .a.s. .|
00000030  55 00 54 00 46 00 2d 00  31 00 36 00 2d 00 4c 00  |U.T.F.-.1.6.-.L.|
00000040  45 00 2e 00 20 00 49 00  74 00 20 00 63 00 6f 00  |E... .I.t. .c.o.|
00000050  6e 00 74 00 61 00 69 00  6e 00 73 00 20 00 73 00  |n.t.a.i.n.s. .s.|
00000060  6f 00 6d 00 65 00 20 00  67 00 61 00 72 00 62 00  |o.m.e. .g.a.r.b.|
00000070  61 00 67 00 65 00 20 00  61 00 74 00 20 00 74 00  |a.g.e. .a.t. .t.|
00000080  68 00 65 00 20 00 73 00  74 00 61 00 72 00 74 00  |h.e. .s.t.a.r.t.|
00000090  20 00 74 00 68 00 61 00  74 00 20 00 6c 00 6f 00  | .t.h.a.t. .l.o.|
000000a0  6f 00 6b 00 73 00 20 00  6c 00 69 00 6b 00 65 00  |o.k.s. .l.i.k.e.|
000000b0  20 00 61 00 20 00 55 00  54 00 46 00 2d 00 38 00  | .a. .U.T.F.-.8.|
000000c0  2d 00 65 00 6e 00 63 00  6f 00 64 00 65 00 64 00  |-.e.n.c.o.d.e.d.|
000000d0  20 00 42 00 4f 00 4d 00  20 00 28 00 62 00 75 00  | .B.O.M. .(.b.u.|
000000e0  74 00 20 00 69 00 73 00  6e 00 19 20 74 00 29 00  |t. .i.s.n.. t.).|
000000f0  2e 00 0a 00                                       |....|
000000f4

@mathiasbynens mathiasbynens changed the title Strip BOM when reading files Strip BOM when reading UTF-8 files May 12, 2014
@yocontra
Copy link
Member

This looks great. The only problem is consistency with streaming mode, so this also needs to be applied here https://github.com/wearefractal/vinyl-fs/blob/master/lib/src/streamFile.js#L4 via a streaming BOM remover which may or may not exist already.

@sindresorhus
Copy link

@mathiasbynens mind updating so we can land this?

strip-bom has stream support: https://github.com/sindresorhus/strip-bom#usage

@mathiasbynens
Copy link
Contributor Author

Done (without a test for streaming mode though).

@coveralls
Copy link

Coverage Status

Coverage increased (+0.04%) when pulling 4c1fc02 on mathiasbynens:patch-1 into abd2fe4 on wearefractal:master.

@yocontra
Copy link
Member

One small problem with the BOM removal streaming - it's possible that a chunk won't contain the entire BOM, you need to buffer the first 3 bytes then check for the BOM. The first chunk could just be one or two bytes and then it will be marked as "no bom found" and slip through. I can't think of a case where bytes would be emitted one by one but it's possible and I don't think it's safe to make assumptions

@yocontra
Copy link
Member

Also, we can go with UTF8 support for now but I would like to see support for others afterwards. I can help out with that if I have time.

https://github.com/jedmao/bombom#pre-registered-bom-types

@sindresorhus
Copy link

@contra i assumed it would never be as low, but as i can't find anywhere mentioning a minimum size i guess i have to do that...

Also, we can go with UTF8 support for now but I would like to see support for others afterwards. I can help out with that if I have time.

There's no reason to strip bom from anything other than UTF8. In UTF16 and others it actually has a use.

@yocontra
Copy link
Member

@sindresorhus Ah okay, I have no idea what I'm saying then - disregard 🌵

@yocontra
Copy link
Member

Once the strip-bom module has support for lower watermarks I'll merge this and add the stream test before publishing, then include in gulp 3.7 (releasing later tonight)

sindresorhus added a commit to sindresorhus/strip-bom that referenced this pull request May 25, 2014
@sindresorhus
Copy link

@contra done

@yocontra
Copy link
Member

@sindresorhus Nice job on first-chunk-stream

yocontra added a commit that referenced this pull request May 25, 2014
Strip BOM when reading UTF-8 files
@yocontra yocontra merged commit 15c0f9a into gulpjs:master May 25, 2014
@phated phated mentioned this pull request Aug 12, 2015
phated pushed a commit that referenced this pull request Dec 5, 2017
Strip BOM when reading UTF-8 files
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants