Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is it a bug to return BOM header after calling fs.readFile when the text encoding (like utf8) is specified? #6924

Closed
webarchymeta opened this issue May 22, 2016 · 9 comments
Labels
feature request Issues that request new features to be added to Node.js. fs Issues and PRs related to the fs subsystem / file system. question Issues that look for answers.

Comments

@webarchymeta
Copy link

webarchymeta commented May 22, 2016

  • Version:: all version so far
  • Platform:: all platform
  • Subsystem: ...

When loading text content from filepath that are utf-8 encode and having a BOM header

fs.readFile(filepath, 'utf8', (err, content) => {

});

the content returned still contains the BOM header, in addition to the text content. It causes a lot of wasted efforts to remove such a header in custom codes and is not the kind of behavior in other languages ...

Removing such a header in node shouldn't affect most of the existing codes that depend on it since such a header is most likely not used by custom codes anyway

@vkurchatkin vkurchatkin added the fs Issues and PRs related to the fs subsystem / file system. label May 22, 2016
@vsemozhetbyt
Copy link
Contributor

It seems developers leave this to userland: #3040

@webarchymeta
Copy link
Author

webarchymeta commented May 22, 2016

I don't understand the rational behind such a decision, it's seems to be semantically incorrect ... It does not take a lot to do it in the core at all

@vsemozhetbyt
Copy link
Contributor

vsemozhetbyt commented May 22, 2016

Well, I also am tired of writing time after time:

  const rli = rl.createInterface({input: fs.createReadStream(path, 'utf8')});
  let lineNumber = 0;
  rli.on('line', line => {
    if (++lineNumber === 1) line = line.replace(bomRE, '');
    //...
  })

May be something will change in the distant future.

@mscdex mscdex added the question Issues that look for answers. label May 22, 2016
@imyller
Copy link
Member

imyller commented May 22, 2016

There are at multiple (15+) different Byte Order Mark (BOM) values that can be received from a stream. Actual value of BOM is very relevant to someone implementing a generic parser capable of handling anything thrown at it - or at least fail gracefully.

If core receives the BOM and stops passing it through then core should also handle all the nuances of every encoding and endianness accordingly and somehow present them in a uniform way for userland stream. That seems unlikely to happen and no-one wants the core to make dumbed down assumptions as a shortcut just because majority of users are expecting to receive UTF-8 on little-endian CPU.

So, it is true that requesting stream in UTF-8 has fixed BOM of EF BB BF (hex), but that does not apply to other possible encoding/endianess combinations.

Maybe optional parameter for dropping BOMs from read streams could be offered?

@webarchymeta
Copy link
Author

Yeh, what you suggested could be an option for the time being ...

But still ... I am not an expert in text encoding so I do not know a lot intricacies about it. But to me it seems that byte ordering is platform dependent only, after having decoded a block of text successfully from the raw bytes the node core had already made the right conversion (using the BOM) and send the resulting text into the userland. What is the use of the BOM information after that, inside the userland, which most like doing only platform independent programming?

For those who make generic parsers (the real men :-)), they can always read the raw bytes into a Buffer object and start from there. Am I missing something in my statements?

@imyller
Copy link
Member

imyller commented May 23, 2016

@webarchymeta

What is the use of the BOM information after that, inside the userland, which most like doing only platform independent programming?

Exactly. For example: embedded system code that gets cross-compiled to multiple platforms, code interfacing with legacy systems/hardware or mission-critical code with need to gracefully handle every possible input scenario; no matter how implausible.

For those who make generic parsers (the real men :-)), they can always read the raw bytes into a Buffer object and start from there. Am I missing something in my statements?

You might be right about that, but I'm still arguing that BOMs are not by any means irrelevant nuisance and exposing them makes Node.js stronger in capability than it would be if hiding them. Streams starting with BOMs are defined in unicode standard.

That said, it I still think that it would be good idea to offer option for BOM stripped streams - just for developer comfort. Also, BOMs could be used to implement ReadStream with true unicode conversion between source content encoding (defined by its BOM) and output encoding requested by the user. Especially in that case outputting BOM bytes could very well be optional if conversion is reliable.

@jasnell
Copy link
Member

jasnell commented Jun 7, 2016

It would potentially be possible to add a omitBOM: false|true option to the fs.readFile API and have it attempt to strip the BOM for known encodings (utf8 and utf16le). A PR would be welcomed :-) That's no guarantee that it would land because others would need to weigh in on it but I'd welcome it for sure.

@jasnell jasnell added the feature request Issues that request new features to be added to Node.js. label Jun 7, 2016
@vsemozhetbyt
Copy link
Contributor

Maybe for streams it would be also very helpful.

@Trott
Copy link
Member

Trott commented Jul 7, 2017

This issue has been inactive for sufficiently long that it seems like perhaps it should be closed. Feel free to re-open (or leave a comment requesting that it be re-opened) if you disagree. I'm just tidying up and not acting on a super-strong opinion or anything like that.

@Trott Trott closed this as completed Jul 7, 2017
rmg added a commit to rmg/cipm that referenced this issue Sep 2, 2017
fs.readFile* don't strip the BOM header, even when specifying 'utf8' as
the encoding, and JSON.parse() doesn't handle it either. There are
technically a bunch of BOM indicators, but this is the one seen most
commonly and actually appears in a number of package.json files in the
wild.

See nodejs/node#6924, nodejs/node#3040 for background.
rmg added a commit to rmg/cipm that referenced this issue Sep 2, 2017
fs.readFile* don't strip the BOM header, even when specifying 'utf8' as
the encoding, and JSON.parse() doesn't handle it either. There are
technically a bunch of BOM indicators, but this is the one seen most
commonly and actually appears in a number of package.json files in the
wild.

See nodejs/node#6924, nodejs/node#3040 for background.
rmg added a commit to rmg/cipm that referenced this issue Sep 2, 2017
fs.readFile* don't strip the BOM header, even when specifying 'utf8' as
the encoding, and JSON.parse() doesn't handle it either. There are
technically a bunch of BOM indicators, but this is the one seen most
commonly and actually appears in a number of package.json files in the
wild.

See nodejs/node#6924, nodejs/node#3040 for background.
rmg added a commit to rmg/cipm that referenced this issue Sep 3, 2017
fs.readFile* don't strip the BOM header, even when specifying 'utf8' as
the encoding, and JSON.parse() doesn't handle it either. There are
technically a bunch of BOM indicators, but this is the one seen most
commonly and actually appears in a number of package.json files in the
wild.

See nodejs/node#6924, nodejs/node#3040 for background.
rmg added a commit to rmg/cipm that referenced this issue Sep 3, 2017
fs.readFile* don't strip the BOM header, even when specifying 'utf8' as
the encoding, and JSON.parse() doesn't handle it either. There are
technically a bunch of BOM indicators, but this is the one seen most
commonly and actually appears in a number of package.json files in the
wild.

See nodejs/node#6924, nodejs/node#3040 for background.
rmg added a commit to rmg/cipm that referenced this issue Sep 5, 2017
fs.readFile* don't strip the BOM header, even when specifying 'utf8' as
the encoding, and JSON.parse() doesn't handle it either. There are
technically a bunch of BOM indicators, but this is the one seen most
commonly and actually appears in a number of package.json files in the
wild.

See nodejs/node#6924, nodejs/node#3040 for background.
zkat pushed a commit to zkat/cipm that referenced this issue Sep 6, 2017
fs.readFile* don't strip the BOM header, even when specifying 'utf8' as
the encoding, and JSON.parse() doesn't handle it either. There are
technically a bunch of BOM indicators, but this is the one seen most
commonly and actually appears in a number of package.json files in the
wild.

See nodejs/node#6924, nodejs/node#3040 for background.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request Issues that request new features to be added to Node.js. fs Issues and PRs related to the fs subsystem / file system. question Issues that look for answers.
Projects
None yet
Development

No branches or pull requests

7 participants