-
-
Notifications
You must be signed in to change notification settings - Fork 156
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Strip BOM when reading UTF-8 files #21
Conversation
The point of this PR is to get the discussion going. I’ll add tests etc. if you agree this is a useful addition. |
👍 Could also use https://github.com/sindresorhus/strip-bom (it also supports buffers), but no worries if not. |
@sindresorhus Should’ve known you had a package for that! Updated the PR. I’ve now added a test, too. Note to self: easy way to create a “UTF-8 with BOM” file for testing: $ echo 'This file is saved as UTF-8 with BOM.' > test/fixtures/bom.txt
$ vim -e -s -c ':set bomb' -c ':wq' test/fixtures/bom.txt |
@mathiasbynens or just |
We should probably only strip any BOM if the input if it’s in UTF-8 – else it becomes impossible for people to use any encodings that rely on the BOM (not that they should…). Or, we could add that |
It definitely should only strip BOM from utf8. As far as I know Node only supports reading in files as buffer, utf8 or ascii, not utf16, so not sure how it would ever be a problem? No, to |
If a Big-Endian UTF-16-encoded file starts with U+EFBB followed by U+BF00 (or any other code point in the range from U+BF00 to U+BFFF) then the first three bytes in the buffer would be Since the code point U+EFBB has not been assigned a symbol this specific example is definitely an edge case. But my point is that I shouldn’t just assume that |
Is there anything strip-bom can do to protect against that? |
strip-bom is intended to be used for UTF-8 encoded files (right?) so it’s fine. If we want to make strip-bom work for any file, we’d need to detect if if it’s a UTF-8-encoded file somehow, and only strip the BOM in that case. |
Hmm, all the peoples on the internet says to do it like that though... |
Well, yes, but would be nice if consumers didn't have to make sure it's utf8 themselves. |
I just meant that you’ll probably want to create a separate |
Updated this PR to use the new version of strip-bom, which only affects UTF-8 strings and buffers. 👍 For the test files, I took a valid UTF-16-BE/LE files and prepended $ hexdump -C test/fixtures/bom-utf16be.txt
00000000 ef bb bf 00 00 54 00 68 00 69 00 73 00 20 00 66 |.....T.h.i.s. .f|
00000010 00 69 00 6c 00 65 00 20 00 69 00 73 00 20 00 73 |.i.l.e. .i.s. .s|
00000020 00 61 00 76 00 65 00 64 00 20 00 61 00 73 00 20 |.a.v.e.d. .a.s. |
00000030 00 55 00 54 00 46 00 2d 00 31 00 36 00 2d 00 42 |.U.T.F.-.1.6.-.B|
00000040 00 45 00 2e 00 20 00 49 00 74 00 20 00 63 00 6f |.E... .I.t. .c.o|
00000050 00 6e 00 74 00 61 00 69 00 6e 00 73 00 20 00 73 |.n.t.a.i.n.s. .s|
00000060 00 6f 00 6d 00 65 00 20 00 67 00 61 00 72 00 62 |.o.m.e. .g.a.r.b|
00000070 00 61 00 67 00 65 00 20 00 61 00 74 00 20 00 74 |.a.g.e. .a.t. .t|
00000080 00 68 00 65 00 20 00 73 00 74 00 61 00 72 00 74 |.h.e. .s.t.a.r.t|
00000090 00 20 00 74 00 68 00 61 00 74 00 20 00 6c 00 6f |. .t.h.a.t. .l.o|
000000a0 00 6f 00 6b 00 73 00 20 00 6c 00 69 00 6b 00 65 |.o.k.s. .l.i.k.e|
000000b0 00 20 00 61 00 20 00 55 00 54 00 46 00 2d 00 38 |. .a. .U.T.F.-.8|
000000c0 00 2d 00 65 00 6e 00 63 00 6f 00 64 00 65 00 64 |.-.e.n.c.o.d.e.d|
000000d0 00 20 00 42 00 4f 00 4d 00 20 00 28 00 62 00 75 |. .B.O.M. .(.b.u|
000000e0 00 74 00 20 00 69 00 73 00 6e 20 19 00 74 00 29 |.t. .i.s.n ..t.)|
000000f0 00 2e 00 0a |....|
000000f4
$ hexdump -C test/fixtures/bom-utf16le.txt
00000000 ef bb bf 00 54 00 68 00 69 00 73 00 20 00 66 00 |....T.h.i.s. .f.|
00000010 69 00 6c 00 65 00 20 00 69 00 73 00 20 00 73 00 |i.l.e. .i.s. .s.|
00000020 61 00 76 00 65 00 64 00 20 00 61 00 73 00 20 00 |a.v.e.d. .a.s. .|
00000030 55 00 54 00 46 00 2d 00 31 00 36 00 2d 00 4c 00 |U.T.F.-.1.6.-.L.|
00000040 45 00 2e 00 20 00 49 00 74 00 20 00 63 00 6f 00 |E... .I.t. .c.o.|
00000050 6e 00 74 00 61 00 69 00 6e 00 73 00 20 00 73 00 |n.t.a.i.n.s. .s.|
00000060 6f 00 6d 00 65 00 20 00 67 00 61 00 72 00 62 00 |o.m.e. .g.a.r.b.|
00000070 61 00 67 00 65 00 20 00 61 00 74 00 20 00 74 00 |a.g.e. .a.t. .t.|
00000080 68 00 65 00 20 00 73 00 74 00 61 00 72 00 74 00 |h.e. .s.t.a.r.t.|
00000090 20 00 74 00 68 00 61 00 74 00 20 00 6c 00 6f 00 | .t.h.a.t. .l.o.|
000000a0 6f 00 6b 00 73 00 20 00 6c 00 69 00 6b 00 65 00 |o.k.s. .l.i.k.e.|
000000b0 20 00 61 00 20 00 55 00 54 00 46 00 2d 00 38 00 | .a. .U.T.F.-.8.|
000000c0 2d 00 65 00 6e 00 63 00 6f 00 64 00 65 00 64 00 |-.e.n.c.o.d.e.d.|
000000d0 20 00 42 00 4f 00 4d 00 20 00 28 00 62 00 75 00 | .B.O.M. .(.b.u.|
000000e0 74 00 20 00 69 00 73 00 6e 00 19 20 74 00 29 00 |t. .i.s.n.. t.).|
000000f0 2e 00 0a 00 |....|
000000f4 |
This looks great. The only problem is consistency with streaming mode, so this also needs to be applied here https://github.com/wearefractal/vinyl-fs/blob/master/lib/src/streamFile.js#L4 via a streaming BOM remover which may or may not exist already. |
@mathiasbynens mind updating so we can land this?
|
Done (without a test for streaming mode though). |
One small problem with the BOM removal streaming - it's possible that a chunk won't contain the entire BOM, you need to buffer the first 3 bytes then check for the BOM. The first chunk could just be one or two bytes and then it will be marked as "no bom found" and slip through. I can't think of a case where bytes would be emitted one by one but it's possible and I don't think it's safe to make assumptions |
Also, we can go with UTF8 support for now but I would like to see support for others afterwards. I can help out with that if I have time. https://github.com/jedmao/bombom#pre-registered-bom-types |
@contra i assumed it would never be as low, but as i can't find anywhere mentioning a minimum size i guess i have to do that...
There's no reason to strip bom from anything other than UTF8. In UTF16 and others it actually has a use. |
@sindresorhus Ah okay, I have no idea what I'm saying then - disregard 🌵 |
Once the strip-bom module has support for lower watermarks I'll merge this and add the stream test before publishing, then include in gulp 3.7 (releasing later tonight) |
@contra done |
@sindresorhus Nice job on first-chunk-stream |
Strip BOM when reading UTF-8 files
Strip BOM when reading UTF-8 files
When using gulp-concat, I noticed that the BOM wasn’t stripped from files. When concatenating files together there would be multiple BOMs in the resulting file, which only leads to increased file size for no good reason. (Of course, people shouldn’t use UTF-8 with BOM in the first place, but that’s a separate issue…)
Let’s do what Grunt does and strip BOM by default. Grunt has a setting to override it but IMHO this is not needed. https://github.com/gruntjs/grunt/blob/057eb1c3a7ffdb26b6e2e1a5eb67b262cdd3a71d/lib/grunt/file.js#L238-L241
Thanks to @phated for the pointer to vinyl-fs on IRC.