Proposal: add option to automatically add BOM to write methods #11767

mcdado · 2017-03-09T10:46:50Z

Today I found out that you need to manually add a unicode representation of the Byte Order Mark in unicode files/streams.

The fact that you have to manually prepend it leads to confusion IMHO. I think that it would be better to add a addBom (or something like that) as an option to the different write methods, that would remove the manuality of the process.

The text was updated successfully, but these errors were encountered:

vsemozhetbyt · 2017-03-09T11:52:11Z

This matter emerges from time to time. See, for example, these issues with comments:

#3040
#6924

sam-github · 2017-03-09T18:08:38Z

The BOM is part of the file, removing or adding it would modify the data to/from disk, and trigger another set of bug reports along the lines of "why is the data I read in node smaller than the file on disk?", and "why does this file I wrote have these strange bytes in the front?". Its a bit awkward, but I think explicit is better than guessing the user's intentions here, though it might be possible to introduce new encoding names with +bom or something, that will add/remove the BOM implicitly (but under the explicit control of the user).

mcdado · 2017-03-09T18:35:40Z

If you write as ucs2le (utf16le) and don't know that you have to manually add the BOM (even though BOM is mandatory in that encoding), you create invalid data. I had to find a discussion (linked before) where somebody had to figure out the Unicode character to add the BOM, meaning is not obvious. I think that Core should have an extra affordance that takes care of this. --

…

-- *David Gasperoni*

sam-github · 2017-03-09T19:02:56Z

What is your specific suggestion? What "different write methods" would you like addBom: to be an option for?

/cc @srl295

mcdado · 2017-03-09T19:28:34Z

I imagined to be an option like addBom for writing and stripBom for reading, but it could also be "bom: auto|on|off" to leave the stream as it is, either way. With "auto" it would be on for encodings like utf16le but off for UTF8, for example. --

…

-- *David Gasperoni*

bnoordhuis · 2017-03-10T08:50:39Z

With "auto" it would be on for encodings like utf16le but off for UTF8

The UTF-8 byte order mark (EF BB BF), while not common and discouraged, is in use.

It's trivial to strip from incoming data but I predict endless discussions on whether it should be added or not to outgoing data.

mcdado · 2017-03-10T08:58:38Z

With "auto" it would be on for encodings like utf16le but off for UTF8

The UTF-8 byte order mark (EF BB BF), while not common and discouraged, is in use.

Exactly, at that point we're just talking defaults. To me, auto in write methods would not add BOM if the enconding is utf8, but would add it if the encoding is utf16le. See here:

The standard also allows the byte order to be stated explicitly by specifying UTF-16BE or UTF-16LE as the encoding type. When the byte order is specified explicitly this way, a BOM is specifically not supposed to be prepended to the text, and a U+FEFF at the beginning should be handled as a ZWNBSP character. Many applications ignore the BOM code at the start of any Unicode encoding. Web browsers often use a BOM as a hint in determining the character encoding.

In my experience, if you encode in utf16le and don't include a BOM, many readers won't be able to interprete it. BBEdit 11.6 on a Mac, VS Code 1.10.1 also on a Mac, they either can't open the file in the latter case or misinterpret it in the former.

bnoordhuis · 2017-03-10T09:08:58Z

I think you miss my point. You should file a pull request if you feel it's a worthwhile addition but you should be prepared for lots of discussion when it's a convenience thing like an 'auto' mode (or even an on/off mode; stripping and inserting BOMs is after all just a convenience.)

mcdado · 2017-03-10T09:22:06Z

Indeed, I see many of those previous Issues are coming from a convenience point of view. The usability of a language/environment comes down from these things too. I'll sketch up the PR to get the conversation going. --

…

-- *David Gasperoni*

zcorpan · 2017-03-14T07:45:55Z

What is the use case for writing UTF-16 at all?

mcdado · 2017-03-14T08:21:26Z

Unfortunately there are business softwares (HP SmartStream Designer) that don't distinguish between UTF8 and Ansi because the former is back-compatible, and their understanding of Unicode is UCS-2 (utf16le). This was my use case, I'm sure there are many more. --

…

-- *David Gasperoni*

silverwind · 2017-03-14T18:10:06Z

I think you might better of wrapping or monkey-patching fs to suit your needs. BTW, are there any other programming languages with such a "feature"?

mcdado · 2017-03-14T20:11:19Z

Personally, as a user of the language, I expect the runtime to know that if I choose the utf16le encoding, it should know that to make a legal file in such encoding, it needs to add the BOM.
If it doesn't automatically takes care of it, I would expect it to obviously show how to add such a character. This is also because it can be confusing to do by hand, because U+FEFF is the Unicode character, but you have to know that you always use as it is instead of flipping it according to the encoding that you're using. My point being that right now the situation is opaque, while it would be better if it was clearer and transparent. Probably the new affordance should be off by default, but adding a new option like "useBom : true" wouldn't cause any drawbacks IMHO.

zcorpan · 2017-03-15T11:40:41Z

Unfortunately there are business softwares (HP SmartStream Designer) that
don't distinguish between UTF8 and Ansi because the former is
back-compatible,

OK, so you need the BOM for utf-8. It seems reasonable to me to have a convenient way to do that.

and their understanding of Unicode is UCS-2 (utf16le).

But you don't need to use utf-16, correct?

A bit of trivia about the BOM for utf-16:
In the era before https://encoding.spec.whatwg.org/ , if the encoding label is "utf-16", the BOM is mandatory; if the encoding label is "utf-16le" or "utf-16be", the BOM is forbidden. (Encoding label is what you'd put in the charset parameter for Content-Type response header in HTTP.)

Today per the Encoding Standard, the utf-16 decoder is more robust wrt the BOM and encoding label, but it does not specify an encoder because browsers do not need an encoder and everyone should be using only utf-8.

mcdado · 2017-03-15T12:16:39Z

I admit that I'm not expert in encodings, or standards about them. In my case, I can only talk by what I experience using them, and this is how I experienced the issue:

> var fs = require('fs');
> fs.writeFileSync('/Users/David/Desktop/test.txt', 'aåäeèéëiïœoøöuü', {encoding: 'utf16le'});

I'm on a Mac, I have several text editors to try to open the test file. In order: TextEdit, BBEdit, Visual Studio Code, Safari, Hex Fiend.

Then I did the following:

fs.writeFileSync('/Users/David/Desktop/test-bom.txt', '\ufeffaåäeèéëiïœoøöuü', {encoding: 'utf16le'});

And without making other screenshots, just trust me when the same apps interpreted the file just fine.

mcdado · 2017-03-15T12:27:21Z

But you don't need to use utf-16, correct?

No, I'll try to explain better: I need to feed this software text files, it uses them as records. It usually happens that there are characters (like the one above in my last comment). If I save the output as utf8, the software does not interpret the file correctly, it assumes it is Ansi. Sure, it's a buggy interpreter, but that's not my problem to solve. This software has support for what it calls Unicode, but I found out that it means either UCS-2 or UTF-16. Through experimentation, I found out that using Notepad++ on Windows and converting to UCS-2 LE (which adds BOM) then it works okay, its interpreter works correctly. That's why I started using utf16le as encoding, but I was surprised that I have to include BOM by hand when, again in my experience, files without it simply don't work anywhere!

This is just anecdotal for the rest of the world I guess, but I thought it showed a "hole" in the assumption of Node trying to write to utf16le streams. If I'm wrong, then we should just keep adding BOM by hand, like, all the time.

zcorpan · 2017-03-15T12:49:48Z

I think you would probably be better off using utf-8 with a BOM than using utf-16 (any variant).

Alternatively use utf-8 without a BOM and configure your editors to default to utf-8 instead of "Ansi".

mcdado · 2017-03-15T14:51:27Z

The editors are not the problem… it's the specific software that expects either UCS-2 or UTF-16.

My example with the various editors was to show that they can't either read or detect UTF-16 without BOM, so it's not just me 😄

jasnell · 2017-03-15T19:50:06Z

The key challenge to writing the BOM automatically is that the stream interface is quite agnostic to the encoding right now. It would be fairly straightforward, however, to create a light weight wrapper interface in userland that does this... something like...

const BomStream = require('...');
const fs = require('fs');
const out = fs.createWriteStream('data');
const bomout = new BomStream.Utf16LeStream(out);
bomout.write('some data');

While I am quite sympathetic to the problem, I don't believe we should be adding support for this in core.

mcdado · 2017-03-15T20:00:21Z

Hmm, maybe I misnamed this Issue, and created confusion along the way.

What I'm proposing is a way to say addBom: true, which to me seems to be fairly minor challenge. I think it solves the discoverability issue, while it wouldn't do anything by default. When used, it wouldn't need to figure out in userland how to specify the BOM unicode character.

I guess it should also be smart enough to not do anything if the encoding being passed in one which doesn't not use BOMs.

hsivonen · 2017-03-16T10:44:36Z

What's the use case for writing UTF-16 without a BOM? Shouldn't a BOM be automatically be added when UTF-16 output is requested?

mcdado · 2017-03-16T11:49:40Z

What's the use case for writing UTF-16 without a BOM? Shouldn't a BOM be automatically be added when UTF-16 output is requested?

That's exactly what brought me here to discuss this.

Trott · 2017-07-30T03:33:59Z

This seems stalled (and seems to me like something that should be solved as a published module before consideration for adding to core, but reasonable people can disagree on that). I'm going to close this, but if that's misguided because there's active work going on or for some other reason, by all means, comment to that effect (or re-open if GitHub allows you to).

mikaelfs · 2019-02-15T10:01:02Z

Considering @mcdado proposal, I looked again at BOM definition from RFC 2781. An excerpt from Section 3.2 Byte Order Mark (BOM):

It is important to understand that the character 0xFEFF appearing at
any position other than the beginning of a stream MUST be interpreted
with the semantics for the zero-width non-breaking space, and MUST
NOT be interpreted as a byte-order mark. The contrapositive of that
statement is not always true: the character 0xFEFF in the first
position of a stream MAY be interpreted as a zero-width non-breaking
space, and is not always a byte-order mark. For example, if a process
splits a UTF-16 string into many parts, a part might begin with
0xFEFF because there was a zero-width non-breaking space at the
beginning of that substring.

According to Node JS documentation for fs.writeFile method, data can be string, Buffer, TypedArray, or DataView. When user is passing string, the data length is fixed or can be said to represent a single stream with known length. As the first position of single stream that contains multibyte characters will represent BOM, it will be handy to have an option to add BOM (e.g.: addBom option) that is built into the method and is activated when the user supplies string.

I stumbled into this same issue when trying to display East Asian characters in a CSV file generated with Node JS properly in Excel on MacOS. It may not be too intuitive for users if they have to manually add BOM into the data. When the option is there, users can have better insight about how to deal with writing multibyte characters into file properly.

I wrote an article about this issue, including trial and error to properly write multibyte characters into a CSV file that can be later read across different applications.

mcdado · 2019-02-15T18:45:36Z

@mikaelfs thank you for your input. I hope that at least this Issue has the SEO juice to bubble up in search results and help confused people searching for a solution. 🙂

mscdex added feature request Issues that request new features to be added to Node.js. fs Issues and PRs related to the fs subsystem / file system. labels Mar 9, 2017

Trott closed this as completed Jul 30, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal: add option to automatically add BOM to write methods #11767

Proposal: add option to automatically add BOM to write methods #11767

mcdado commented Mar 9, 2017

vsemozhetbyt commented Mar 9, 2017 •

edited

Loading

sam-github commented Mar 9, 2017 •

edited

Loading

mcdado commented Mar 9, 2017 via email

sam-github commented Mar 9, 2017

mcdado commented Mar 9, 2017 via email

bnoordhuis commented Mar 10, 2017

mcdado commented Mar 10, 2017

bnoordhuis commented Mar 10, 2017

mcdado commented Mar 10, 2017 via email

zcorpan commented Mar 14, 2017

mcdado commented Mar 14, 2017 via email

silverwind commented Mar 14, 2017

mcdado commented Mar 14, 2017

zcorpan commented Mar 15, 2017

mcdado commented Mar 15, 2017 •

edited

Loading

mcdado commented Mar 15, 2017 •

edited

Loading

zcorpan commented Mar 15, 2017

mcdado commented Mar 15, 2017

jasnell commented Mar 15, 2017

mcdado commented Mar 15, 2017

hsivonen commented Mar 16, 2017

mcdado commented Mar 16, 2017

Trott commented Jul 30, 2017

mikaelfs commented Feb 15, 2019 •

edited

Loading

mcdado commented Feb 15, 2019 •

edited

Loading

Proposal: add option to automatically add BOM to write methods #11767

Proposal: add option to automatically add BOM to write methods #11767

Comments

mcdado commented Mar 9, 2017

vsemozhetbyt commented Mar 9, 2017 • edited Loading

sam-github commented Mar 9, 2017 • edited Loading

mcdado commented Mar 9, 2017 via email

sam-github commented Mar 9, 2017

mcdado commented Mar 9, 2017 via email

bnoordhuis commented Mar 10, 2017

mcdado commented Mar 10, 2017

bnoordhuis commented Mar 10, 2017

mcdado commented Mar 10, 2017 via email

zcorpan commented Mar 14, 2017

mcdado commented Mar 14, 2017 via email

silverwind commented Mar 14, 2017

mcdado commented Mar 14, 2017

zcorpan commented Mar 15, 2017

mcdado commented Mar 15, 2017 • edited Loading

mcdado commented Mar 15, 2017 • edited Loading

zcorpan commented Mar 15, 2017

mcdado commented Mar 15, 2017

jasnell commented Mar 15, 2017

mcdado commented Mar 15, 2017

hsivonen commented Mar 16, 2017

mcdado commented Mar 16, 2017

Trott commented Jul 30, 2017

mikaelfs commented Feb 15, 2019 • edited Loading

mcdado commented Feb 15, 2019 • edited Loading

vsemozhetbyt commented Mar 9, 2017 •

edited

Loading

sam-github commented Mar 9, 2017 •

edited

Loading

mcdado commented Mar 15, 2017 •

edited

Loading

mcdado commented Mar 15, 2017 •

edited

Loading

mikaelfs commented Feb 15, 2019 •

edited

Loading

mcdado commented Feb 15, 2019 •

edited

Loading