Is it a bug to return BOM header after calling fs.readFile when the text encoding (like utf8) is specified? #6924

webarchymeta · 2016-05-22T20:59:26Z

Version:: all version so far
Platform:: all platform
Subsystem: ...

When loading text content from filepath that are utf-8 encode and having a BOM header

fs.readFile(filepath, 'utf8', (err, content) => {

});

the content returned still contains the BOM header, in addition to the text content. It causes a lot of wasted efforts to remove such a header in custom codes and is not the kind of behavior in other languages ...

Removing such a header in node shouldn't affect most of the existing codes that depend on it since such a header is most likely not used by custom codes anyway

The text was updated successfully, but these errors were encountered:

vsemozhetbyt · 2016-05-22T21:20:29Z

It seems developers leave this to userland: #3040

webarchymeta · 2016-05-22T21:43:07Z

I don't understand the rational behind such a decision, it's seems to be semantically incorrect ... It does not take a lot to do it in the core at all

vsemozhetbyt · 2016-05-22T22:02:51Z

Well, I also am tired of writing time after time:

  const rli = rl.createInterface({input: fs.createReadStream(path, 'utf8')});
  let lineNumber = 0;
  rli.on('line', line => {
    if (++lineNumber === 1) line = line.replace(bomRE, '');
    //...
  })

May be something will change in the distant future.

imyller · 2016-05-22T23:48:05Z

There are at multiple (15+) different Byte Order Mark (BOM) values that can be received from a stream. Actual value of BOM is very relevant to someone implementing a generic parser capable of handling anything thrown at it - or at least fail gracefully.

If core receives the BOM and stops passing it through then core should also handle all the nuances of every encoding and endianness accordingly and somehow present them in a uniform way for userland stream. That seems unlikely to happen and no-one wants the core to make dumbed down assumptions as a shortcut just because majority of users are expecting to receive UTF-8 on little-endian CPU.

So, it is true that requesting stream in UTF-8 has fixed BOM of EF BB BF (hex), but that does not apply to other possible encoding/endianess combinations.

Maybe optional parameter for dropping BOMs from read streams could be offered?

webarchymeta · 2016-05-23T02:20:44Z

Yeh, what you suggested could be an option for the time being ...

But still ... I am not an expert in text encoding so I do not know a lot intricacies about it. But to me it seems that byte ordering is platform dependent only, after having decoded a block of text successfully from the raw bytes the node core had already made the right conversion (using the BOM) and send the resulting text into the userland. What is the use of the BOM information after that, inside the userland, which most like doing only platform independent programming?

For those who make generic parsers (the real men :-)), they can always read the raw bytes into a Buffer object and start from there. Am I missing something in my statements?

imyller · 2016-05-23T04:14:42Z

@webarchymeta

What is the use of the BOM information after that, inside the userland, which most like doing only platform independent programming?

Exactly. For example: embedded system code that gets cross-compiled to multiple platforms, code interfacing with legacy systems/hardware or mission-critical code with need to gracefully handle every possible input scenario; no matter how implausible.

For those who make generic parsers (the real men :-)), they can always read the raw bytes into a Buffer object and start from there. Am I missing something in my statements?

You might be right about that, but I'm still arguing that BOMs are not by any means irrelevant nuisance and exposing them makes Node.js stronger in capability than it would be if hiding them. Streams starting with BOMs are defined in unicode standard.

That said, it I still think that it would be good idea to offer option for BOM stripped streams - just for developer comfort. Also, BOMs could be used to implement ReadStream with true unicode conversion between source content encoding (defined by its BOM) and output encoding requested by the user. Especially in that case outputting BOM bytes could very well be optional if conversion is reliable.

jasnell · 2016-06-07T19:19:01Z

It would potentially be possible to add a omitBOM: false|true option to the fs.readFile API and have it attempt to strip the BOM for known encodings (utf8 and utf16le). A PR would be welcomed :-) That's no guarantee that it would land because others would need to weigh in on it but I'd welcome it for sure.

vsemozhetbyt · 2016-06-07T19:33:08Z

Maybe for streams it would be also very helpful.

Trott · 2017-07-07T23:29:58Z

This issue has been inactive for sufficiently long that it seems like perhaps it should be closed. Feel free to re-open (or leave a comment requesting that it be re-opened) if you disagree. I'm just tidying up and not acting on a super-strong opinion or anything like that.

fs.readFile* don't strip the BOM header, even when specifying 'utf8' as the encoding, and JSON.parse() doesn't handle it either. There are technically a bunch of BOM indicators, but this is the one seen most commonly and actually appears in a number of package.json files in the wild. See nodejs/node#6924, nodejs/node#3040 for background.

vkurchatkin added the fs Issues and PRs related to the fs subsystem / file system. label May 22, 2016

mscdex added the question Issues that look for answers. label May 22, 2016

jasnell added the feature request Issues that request new features to be added to Node.js. label Jun 7, 2016

vsemozhetbyt mentioned this issue Mar 9, 2017

Proposal: add option to automatically add BOM to write methods #11767

Closed

Trott closed this as completed Jul 7, 2017

rmg mentioned this issue Sep 2, 2017

fix: strip BOM when reading JSON files zkat/cipm#8

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is it a bug to return BOM header after calling fs.readFile when the text encoding (like utf8) is specified? #6924

Is it a bug to return BOM header after calling fs.readFile when the text encoding (like utf8) is specified? #6924

webarchymeta commented May 22, 2016 •

edited

Loading

vsemozhetbyt commented May 22, 2016

webarchymeta commented May 22, 2016 •

edited

Loading

vsemozhetbyt commented May 22, 2016 •

edited

Loading

imyller commented May 22, 2016 •

edited

Loading

webarchymeta commented May 23, 2016

imyller commented May 23, 2016

jasnell commented Jun 7, 2016

vsemozhetbyt commented Jun 7, 2016

Trott commented Jul 7, 2017

Is it a bug to return BOM header after calling fs.readFile when the text encoding (like utf8) is specified? #6924

Is it a bug to return BOM header after calling fs.readFile when the text encoding (like utf8) is specified? #6924

Comments

webarchymeta commented May 22, 2016 • edited Loading

vsemozhetbyt commented May 22, 2016

webarchymeta commented May 22, 2016 • edited Loading

vsemozhetbyt commented May 22, 2016 • edited Loading

imyller commented May 22, 2016 • edited Loading

webarchymeta commented May 23, 2016

imyller commented May 23, 2016

jasnell commented Jun 7, 2016

vsemozhetbyt commented Jun 7, 2016

Trott commented Jul 7, 2017

webarchymeta commented May 22, 2016 •

edited

Loading

webarchymeta commented May 22, 2016 •

edited

Loading

vsemozhetbyt commented May 22, 2016 •

edited

Loading

imyller commented May 22, 2016 •

edited

Loading