-
-
Notifications
You must be signed in to change notification settings - Fork 156
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
UTF-8 BOM stripping Redux #83
Comments
Pinging @mathiasbynens who wrote the PR for this code originally |
Still 👎 on not stripping this. PR for reference #21 and we've created https://github.com/sindresorhus/gulp-bom for the few that want BOM. All of the gulp libraries are written for the many, not the few, but the few have ways to work around it (hence the plugin). |
@phated vinyl and vinyl-fs are outside of gulp. gulp can wrap vinyl-fs .src to default to BOM stripping while vinyl-fs defaults to not stripping BOM |
@contra I understand what you are saying, but any library (not just gulp) should be written for the 90% case. The 90% (hell, even 99% case here) is to strip the BOM. The 1% can just use a plugin |
@contra it's not like this is an undocumented feature of vinyl-fs (I just searched BOM on the readme) |
do you thinkit would be acceptable, then, for the linux kernel to strip BOM characters? what if your filesystem was trying to be super nice for the 99% of users who, you know, don't care about BOM?
Or what if git did this?
Or netcat?
Libraries are about abstractions of use. filesystem tools do not mutate files based on some prejudice of byte distributions. They honor the fundamental abstraction of a filesystem: the user should be able to put a bitstring in, and get the same bitstring out. |
@jbenet your analogies are egregious, this is a build tool for the web, not curl, not netcat, not the linux kernel. This is a documented feature that takes opinionated stance that takes care of the 99%. |
@jbenet Your comparison is not very fair, IMHO. I don’t think it’s very common for devs to use gulp tasks that mutate source files like that. In all of the (That said, stripping BOM from existing source files could be useful as a Git precommit hook, for example, and sure, gulp could be used for that. But this just adds to the point.) IMHO, by default, concatenating two files together through gulp or any other build tool should not result in a file with a BOM somewhere in the middle. This is anecdotal evidence, but every time I’ve seen a file with a BOM in it the developer was unaware their editor was configured to add it. It was never an intentional choice. BOM is a bug. |
I’d love to hear more details about @jbenet’s use case. What kind of files are you dealing with that make the BOM so important?
I don’t have a strong opinion on this. |
No. vinyl-fs is used by many tools that transfer arbitrary binary files. including ours! (see the end of this post for an example much like nc and git that today uses vinyl-fs.
I think vinyl-fs is a very useful module, well outside of the confines of gulp. https://www.npmjs.com/browse/depended/vinyl-fs shows lots of other uses. This is https://github.com/wearefractal/vinyl-fs, an adapter for https://github.com/wearefractal/vinyl and the filesystem. It is inspired by -- and used for -- gulp, but is a separate module, as far as github, npm, and the abstractions presented by the module readme are concerned. If you intend it not to be used by non-gulp things, i understand -- it's your module, not mine -- i'll go away. But my impression was you wanted this to be like any other module in npm: a general tool with utility outside of the scope of its original motivation (a good thing).
Not all files piped through with vinyl and vinyl-fs are text (ASCII/Unicode) files. Take a look at https://www.npmjs.com/browse/depended/vinyl-fs -- I wonder what % of the module authors actually want some bytes to be stripped, or are even aware. Examples include a tool to download files from the internet and save them to the local file system. I.e. it uses vinyl-fs to save arbitrary binary files. (much like the examples I showed, @phated). I don't think they would want the particular sequence of bytes that happens to coincide with BOM to be stripped in a jpg, video, or a tar file... (that would wreak havoc to alignment of the format frames). Luckily, it seems to only use Btw, yes, I am aware it is trying to do some detection with https://www.npmjs.com/package/is-utf8 -- but that looks not safe for arbitrary byte sequences (e.g. archives with utf files within them.)
Thanks for asking! We are using vinyl-fs to move arbitrary binary files (not unicode!), where the infamous BOM just happens to be a particular sequence of bytes. Moreover, we follow string filesystem semantics-- and files have to be --bit for bit-- exactly the same in and out. We're using vinyl-fs in a tool that effectively does this: # add files in one computer
> hash=$(ipfs add <path>)
# get them in another computer
> ipfs get $hash And it works both in the cli and in the web browser. If the bitstrings change, both fs guarantees and cryptographic checks fail. to remind everyone, what the original PR asked for is this: var vfs = require('vinyl-fs')
var files = vfs.srcs('./**/*', {stripBOM: false}) though i do think @contra is right about this:
I'm done here. I've spent way more than the energy required to solve my own problem. Do what you will. |
A file that starts with a BOM cannot possibly be a valid JPG, video (in any format), or Like I said, I wouldn’t mind if vinyl-fs exposed a |
All that said, I don't see problem to be option which by default to be |
@jbenet Can you provide a more concrete use case (or sample repo) of where our BOM stripping is incorrectly stripping on a non-UTF8 file? |
Just noob here, but why doesn't vinyl-fs strip the BOM from the content, and have a flag that indicates that it was present? dest can then write the BOM out if the file has this BOM flag was set (or not)? Most plug-ins, etc could then just worry about what is in content, blissfully ignoring the fact if the BOM was present or not. I'm probably missing something basic as I'm just picking all this up. You could always make an option then to treat BOMs as raw data for any edge cases in which you want to treat the BOM as content. |
The best, the only logical, non-breaking and reasonable thing would be to just expose option that can control that behavior, which would be enabled by default to support backward compatibility. There's absolutely no problem to have such option, and there's no need to have any so stable reason behind to introduce it. I don't see problem if some user want it and it not breaks the things and don't add complexity - it will only help and allow more control. |
Again: I'm not against this, I'm just trying to understand it entirely before making a judgement |
After looking further into this, and actually looking at how it's currently implemented, I'm going to have to side with jbenet on this one. Vinyl-fs likely shouldn't be BOM stripping, at least by default. It's quite possible to create files that fail all the safety tests, and it adds a non-trivial amount of processing to try and detect utf8 by scanning up to the entire file's contents making sure each byte conforms to uft8 encoding. I'd recommend either moving this logic up into gulp's methods, or having gulp pass a flag it's calls to vinyl-fs. Similar issues have arisen in most languages (C#, python, pascal, VB), and they solve it via a flag during file open or having two separate types of opens (binary vs text). You could even wrap vinyl-fs with a more complete file abstraction for text files, converting line endings in and out as well as stripping BOM and have gulp depend on this instead of a raw vinyl-fs. Last thing you want is a situation where vinyl-fs is accidentally corrupting files. Tracking that down (if you detect it) would be very time consuming. isUtf8 is only a safe check IF you know that the file is text to begin with. It's definitely not safe on arbitrary binary files. For example, imaging a program that has a single option, if written to disk as binary, and that option just happens to be a 16-bit signed int, and the value is -2, then you use vinyl-fs to copy the file, you've just corrupted it. Or a writing a C structure like: struct { short option1, char[2] state } to disk, if option1 is -2, then it'll also get corrupted. I can come up with more examples, but the likelyhood of a user running across this issue and getting bitten by it is non-trivial, and tracking it down would be non-trivial. Silently corrupting files is about the worst thing for a tool chain can do. |
I'm leaning towards this solution:
Does this satisfy all parties? |
👎 on gulp having to override defaults. Just default it to true in vinyl-fs and document it 2 times (once in options, once in the note that is already there) |
I think that is a fair an equitable solution contra. That would allow vinyl-fs to remain by default as seeing content as "sacred", while allowing gulp which often works on text files to default it sane values for it as well. |
@contra a guiding ethos I have taken with gulp 4 is not to override defaults of the underlying libraries because it causes complex hunting through the dependency tree to find where it is being overridden |
@contra also, please re-read #83 (comment) which I think has the most well-informed information |
What's going to happen:
Reasoning: Multiple people made conjecture that "this could be triggered on non-utf8 files" but not one real example was provided. This seems totally fine as default behavior and no real drawbacks other than philosophical differences over what a true fs library should do were provided. |
@contra 👍 |
@contra i'm 👍 for this resolution, as it breaks no code (important!) even if i think the default is incorrect. btw, i did not spend time looking for a non-utf8 file because:
You can test both of these at https://github.com/jbenet/bom-diff-for-vinyl > make test
./vmv.js
moved src/ to dst/
shasum src/test.num
34e399cb21690bfb4dbcc6308f96cd984461160b src/test.num
shasum dst/test.num
adc83b19e793491b1c6ea0fd8b46cd9f32e592fc dst/test.num
diff -r src dst
diff -r src/test.num dst/test.num
1c1
<
---
>
make: *** [test] Error 1 |
After scanning my entire drive, I found no files that would have their BOM stripped that weren't actual text files (or XML/HTML, etc). The only files I found that could cause problems would be the DRM type files that no doubt have signatures that would now fail since the contents of the file changed. |
Actually, in one or in another way, there would have breaking changes cuz it is new major release (if this change is intended to be for v4). So one more breaking change is not an issue, so it can defaults to false for v4+ and users will be warned. But im for |
add option to strip BOM, default to true - closes #83
add option to strip BOM, default to true - closes #83
@sindresorhus @phated @contra @IsaacSanch
#72 this is awful. I understand your interest in making things generally better, and I like making tooling that fixes legacy problems as much as everyone else.
But the contents of files are the user's domain. Not the tool's.
We wasted hours trying to find what in our big pipeline was changing our files. Changing content of files is not something that filesystem tools should do. To a filesystem, the contents of a file are sacred. Imagine if S3 or Dropbox or HTTP or even TCP subtly "tried to fix files for you". It would be a nightmare. All cryptographically secure protocols break!
I urge you to reconsider this. We were using vinyl and vinyl-fs for a bunch of ipfs pipelines, and would like to continue using it. We very much like the viewpoints described here: https://medium.com/@contrahacks/gulp-3828e8126466 But if the vinyl tooling does not hold it as a strong guarantee that contents of files can move from A to B without magically changing underneath, we'll have to abandon it for something else. :'(
I think @contra said it best:
And I agree. It was disappointing to find this design bug.
The text was updated successfully, but these errors were encountered: