Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

src: support UTF-8 in compiled-in JS source files #11129

Merged
merged 2 commits into from
Feb 13, 2017

Conversation

bnoordhuis
Copy link
Member

@bnoordhuis bnoordhuis commented Feb 2, 2017

Detect it when source files in lib/ are not ASCII. Decode them as UTF-8
and store them as UTF-16 in the binary so they can be used as external
string resources without non-ASCII characters getting mangled.

CI: https://ci.nodejs.org/job/node-test-pull-request/6176/

@nodejs-github-bot nodejs-github-bot added build Issues and PRs related to build files or the CI. c++ Issues and PRs that require attention from people who are familiar with C++. labels Feb 2, 2017
@bnoordhuis
Copy link
Member Author

I rather feared this approach may be a bit too forward-looking for some compilers... I'll see if I can dumb it down.

tools/js2c.py Outdated
# Treat non-ASCII as UTF-8 and convert it to UTF-16.
if any(ord(c) > 127 for c in lines):
ctype = 'uint16_t'
data = map(ord, lines.decode('utf-8').encode('utf-16be'))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe 'utf-16be' if sys.byteorder == 'little' else 'utf-16le'?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The big-endian encoding is intentional, the line below stitches the individual octets together again into uint16s. (.encode() returns a byte string in case you're wondering.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see. Right.

@jasnell
Copy link
Member

jasnell commented Feb 2, 2017

Just curious, have there been specific issues or is this purely preventative?

@aqrln
Copy link
Contributor

aqrln commented Feb 2, 2017

@jasnell yes, reported in #10673

@gibfahn
Copy link
Member

gibfahn commented Feb 2, 2017

What's the reason for using UTF-16 rather than UTF-8 (out of interest) to store the files in the binary?

@jasnell
Copy link
Member

jasnell commented Feb 2, 2017

I would have to assume it is because the external string API does not directly support UTF8. That is, if you look at ExternalStringResource, it assumes a utf16_t* buffer while ExternalOneByteStringResource assumes const char*. If the strings were stored as UTF8 we would incur an additional cost at startup that is not necessary.

@aqrln
Copy link
Contributor

aqrln commented Feb 2, 2017

@gibfahn so that they can be used as external strings. UTF-8 strings must be copied to V8's managed memory. See #5458 by @bnoordhuis.

using v8::NewStringType;
using v8::Object;
using v8::String;

template <typename T, size_t N, T P>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tiny suggestion: Could we be a bit more explicit and rename P to Ptr or Pointer? 😄

@aqrln
Copy link
Contributor

aqrln commented Feb 3, 2017

@bnoordhuis ping. Did you have time to dumb it down? 😉 This patch seems to work on GCC 4.8+.

Diff

diff --git a/src/node_javascript.cc b/src/node_javascript.cc
index 8c665381db..07a0daaa83 100644
--- a/src/node_javascript.cc
+++ b/src/node_javascript.cc
@@ -16,16 +16,16 @@ using v8::String;
 template <typename T, size_t N, T P>
 struct ExternalStringResource;
 
-template <size_t N, const char (&P)[N]>
-struct ExternalStringResource<const char[N], N, P>
+template <size_t N, const char* P>
+struct ExternalStringResource<const char*, N, P>
     : public String::ExternalOneByteStringResource {
   const char* data() const override { return P; }
   size_t length() const override { return N; }
   void Dispose() override { /* Default calls `delete this`. */ }
 };
 
-template <size_t N, const uint16_t (&P)[N]>
-struct ExternalStringResource<const uint16_t[N], N, P>
+template <size_t N, const uint16_t* P>
+struct ExternalStringResource<const uint16_t*, N, P>
     : public String::ExternalStringResource {
   const uint16_t* data() const override { return P; }
   size_t length() const override { return N; }
@@ -34,7 +34,7 @@ struct ExternalStringResource<const uint16_t[N], N, P>
 
 // id##_data is defined in node_natives.h.
 #define V(id) \
-  static ExternalStringResource<decltype(id##_data),                          \
+  static ExternalStringResource<decltype(&id##_data[0]),                      \
                                 arraysize(id##_data),                         \
                                 id##_data> id##_external_data;
 NODE_NATIVES_MAP(V)
diff --git a/tools/js2c.py b/tools/js2c.py
index e6c56cf075..8104e50e5a 100755
--- a/tools/js2c.py
+++ b/tools/js2c.py
@@ -189,7 +189,7 @@ NODE_NATIVES_MAP = """\
 SOURCES = """\
 static const uint8_t {escaped_id}_name[] = {{
 {name}}};
-static const {ctype} {escaped_id}_data[] = {{
+static constexpr const {ctype} {escaped_id}_data[] = {{
 {data}}};
 """
 

@bnoordhuis
Copy link
Member Author

bnoordhuis commented Feb 6, 2017

@aqrln Thanks. I tried a couple of different approaches but I ended up settling for something that's pretty similar to your patch. :-)

I added a second commit that replaces a few U+2019 quotes with ASCII quotes so that their files can be stored as one-byte strings. Can I get a quick LGTM or two?

EDIT: The CI scoreboard is currently linking to a previous CI build. The correct one is https://ci.nodejs.org/job/node-test-commit/7713/.

@richardlau
Copy link
Member

I added a second commit that replaces a few U+2019 quotes with ASCII quotes so that their files can be stored as one-byte strings.

Probably one for another PR, but this sounds like a good candidate for a lint rule.

@bnoordhuis
Copy link
Member Author

bnoordhuis commented Feb 6, 2017

Interesting VS build failure...

an expression involving objects with internal linkage cannot be used as a non-type argument

I suppose the compiler is technically correct, and wasn't that the best kind of correct?

EDIT: New attempt: https://ci.nodejs.org/job/node-test-pull-request/6243/

@bnoordhuis
Copy link
Member Author

bnoordhuis commented Feb 11, 2017

Different take: moved all the logic to js2c.py.

CI: https://ci.nodejs.org/job/node-test-pull-request/6357/

EDIT: Sigh...

node_javascript.cc(55276): error C2466: cannot allocate an array of constant size 0

@bnoordhuis
Copy link
Member Author

CI is finally green except for that silly issue where the ARM buildbots don't report their status properly.

@addaleax The PR changed quite a bit, perhaps you want to take another look?

'inputs': [
'<@(library_files)',
'./config.gypi',
],
'outputs': [
'<(SHARED_INTERMEDIATE_DIR)/node_natives.h',
'<(SHARED_INTERMEDIATE_DIR)/node_javascript.cc',
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You'd expect INTERMEDIATE_DIR to work here but it doesn't: js2c generates the file but the build doesn't pick it up (and I updated the 'sources' list in case you're wondering.) I decided to leave well enough alone for now.

Detect it when source files in lib/ are not ASCII.  Decode them as UTF-8
and store them as UTF-16 in the binary so they can be used as external
string resources without non-ASCII characters getting mangled.

Fixes: nodejs#10673
PR-URL: nodejs#11129
Reviewed-By: Anna Henningsen <anna@addaleax.net>
Reviewed-By: James M Snell <jasnell@gmail.com>
The previous commit stores baked-in files with non-ASCII characters
as UTF-16.  Replace the \u2019 with a regular quote character so that
the files they're in can be stored as one-byte strings.  The UTF-16
functionality is still tested by the Unicode diagram in lib/timers.js.

PR-URL: nodejs#11129
Reviewed-By: Anna Henningsen <anna@addaleax.net>
Reviewed-By: James M Snell <jasnell@gmail.com>
@bnoordhuis bnoordhuis closed this Feb 13, 2017
@bnoordhuis bnoordhuis deleted the fix10673 branch February 13, 2017 18:24
@bnoordhuis bnoordhuis merged commit fd8587e into nodejs:master Feb 13, 2017
italoacasas pushed a commit to italoacasas/node that referenced this pull request Feb 14, 2017
Detect it when source files in lib/ are not ASCII.  Decode them as UTF-8
and store them as UTF-16 in the binary so they can be used as external
string resources without non-ASCII characters getting mangled.

Fixes: nodejs#10673
PR-URL: nodejs#11129
Reviewed-By: Anna Henningsen <anna@addaleax.net>
Reviewed-By: James M Snell <jasnell@gmail.com>
italoacasas pushed a commit to italoacasas/node that referenced this pull request Feb 14, 2017
The previous commit stores baked-in files with non-ASCII characters
as UTF-16.  Replace the \u2019 with a regular quote character so that
the files they're in can be stored as one-byte strings.  The UTF-16
functionality is still tested by the Unicode diagram in lib/timers.js.

PR-URL: nodejs#11129
Reviewed-By: Anna Henningsen <anna@addaleax.net>
Reviewed-By: James M Snell <jasnell@gmail.com>
aqrln added a commit to aqrln/node that referenced this pull request Feb 16, 2017
In order to allow using Unicode characters inside comments of built-in
JavaScript libraries without forcing them to be stored as UTF-16 data in
Node's binary, update the tooling to strip comments during build
process. All line breaks are preserved so that line numbers in stack
traces aren't broken.

Refs: nodejs#11129
Refs: nodejs#11371 (comment)
aqrln added a commit to aqrln/node that referenced this pull request Feb 16, 2017
This test ensures that UTF-8 characters can be used in core JavaScript
modules built into Node's binary.

Refs: nodejs#11129
italoacasas added a commit that referenced this pull request Feb 16, 2017
Notable changes:

* deps:
    * update V8 to 5.5 (Michaël Zasso) [#11029](#11029)
    * upgrade libuv to 1.11.0 (cjihrig) [#11094](#11094)
    * add node-inspect 1.10.2 (Jan Krems) [#10187](#10187)
* lib: build `node inspect` into `node` (Anna Henningsen) [#10187](#10187)
* crypto: Remove expired certs from CNNIC whitelist (Shigeki Ohtsu) [#9469](#9469)
* inspector: add --inspect-brk (Josh Gavant) [#11149](#11149)
* fs: allow WHATWG URL and file: URLs as paths (James M Snell) [#10739](#10739)
* src: support UTF-8 in compiled-in JS source files (Ben Noordhuis) [#11129](#11129)
* url: extend url.format to support WHATWG URL (James M Snell) [#10857](#10857)

PR-URL: #11185
italoacasas added a commit to italoacasas/node that referenced this pull request Feb 21, 2017
Notable changes:

* deps:
    * update V8 to 5.5 (Michaël Zasso) [nodejs#11029](nodejs#11029)
    * upgrade libuv to 1.11.0 (cjihrig) [nodejs#11094](nodejs#11094)
    * add node-inspect 1.10.4 (Jan Krems) [nodejs#10187](nodejs#10187)
    * upgrade zlib to 1.2.11 (Sam Roberts) [nodejs#10980](nodejs#10980)
* lib: build `node inspect` into `node` (Anna Henningsen) [nodejs#10187](nodejs#10187)
* crypto: Remove expired certs from CNNIC whitelist (Shigeki Ohtsu) [nodejs#9469](nodejs#9469)
* inspector: add --inspect-brk (Josh Gavant) [nodejs#11149](nodejs#11149)
* fs: allow WHATWG URL objects as paths (James M Snell) [nodejs#10739](nodejs#10739)
* src: support UTF-8 in compiled-in JS source files (Ben Noordhuis) [nodejs#11129](nodejs#11129)
* url: extend url.format to support WHATWG URL (James M Snell) [nodejs#10857](nodejs#10857)

PR-URL: nodejs#11185
italoacasas added a commit to italoacasas/node that referenced this pull request Feb 22, 2017
Notable changes:

* deps:
    * update V8 to 5.5 (Michaël Zasso) [nodejs#11029](nodejs#11029)
    * upgrade libuv to 1.11.0 (cjihrig) [nodejs#11094](nodejs#11094)
    * add node-inspect 1.10.4 (Jan Krems) [nodejs#10187](nodejs#10187)
    * upgrade zlib to 1.2.11 (Sam Roberts) [nodejs#10980](nodejs#10980)
* lib: build `node inspect` into `node` (Anna Henningsen) [nodejs#10187](nodejs#10187)
* crypto: Remove expired certs from CNNIC whitelist (Shigeki Ohtsu) [nodejs#9469](nodejs#9469)
* inspector: add --inspect-brk (Josh Gavant) [nodejs#11149](nodejs#11149)
* fs: allow WHATWG URL objects as paths (James M Snell) [nodejs#10739](nodejs#10739)
* src: support UTF-8 in compiled-in JS source files (Ben Noordhuis) [nodejs#11129](nodejs#11129)
* url: extend url.format to support WHATWG URL (James M Snell) [nodejs#10857](nodejs#10857)

PR-URL: nodejs#11185
imyller added a commit to imyller/meta-nodejs that referenced this pull request Mar 2, 2017
    Notable changes:

    * deps:
        * update V8 to 5.5 (Michaël Zasso) [#11029](nodejs/node#11029)
        * upgrade libuv to 1.11.0 (cjihrig) [#11094](nodejs/node#11094)
        * add node-inspect 1.10.4 (Jan Krems) [#10187](nodejs/node#10187)
        * upgrade zlib to 1.2.11 (Sam Roberts) [#10980](nodejs/node#10980)
    * lib: build `node inspect` into `node` (Anna Henningsen) [#10187](nodejs/node#10187)
    * crypto: Remove expired certs from CNNIC whitelist (Shigeki Ohtsu) [#9469](nodejs/node#9469)
    * inspector: add --inspect-brk (Josh Gavant) [#11149](nodejs/node#11149)
    * fs: allow WHATWG URL objects as paths (James M Snell) [#10739](nodejs/node#10739)
    * src: support UTF-8 in compiled-in JS source files (Ben Noordhuis) [#11129](nodejs/node#11129)
    * url: extend url.format to support WHATWG URL (James M Snell) [#10857](nodejs/node#10857)

    PR-URL: nodejs/node#11185

Signed-off-by: Ilkka Myller <ilkka.myller@nodefield.com>
@jasnell
Copy link
Member

jasnell commented Mar 7, 2017

Needs a backport PR to land in v4.
Only one of the two commits was cherry-picked into v6 (the second one wasn't needed)

jasnell pushed a commit that referenced this pull request Mar 7, 2017
Detect it when source files in lib/ are not ASCII.  Decode them as UTF-8
and store them as UTF-16 in the binary so they can be used as external
string resources without non-ASCII characters getting mangled.

Fixes: #10673
PR-URL: #11129
Reviewed-By: Anna Henningsen <anna@addaleax.net>
Reviewed-By: James M Snell <jasnell@gmail.com>
MylesBorins pushed a commit that referenced this pull request Mar 9, 2017
Detect it when source files in lib/ are not ASCII.  Decode them as UTF-8
and store them as UTF-16 in the binary so they can be used as external
string resources without non-ASCII characters getting mangled.

Fixes: #10673
PR-URL: #11129
Reviewed-By: Anna Henningsen <anna@addaleax.net>
Reviewed-By: James M Snell <jasnell@gmail.com>
@MylesBorins MylesBorins mentioned this pull request Mar 9, 2017
jasnell pushed a commit that referenced this pull request Apr 4, 2017
This test ensures that UTF-8 characters can be used in core JavaScript
modules built into Node's binary.

PR-URL: #11423
Ref: #11129
Reviewed-By: Anna Henningsen <anna@addaleax.net>
Reviewed-By: Colin Ihrig <cjihrig@gmail.com>
Reviewed-By: Ben Noordhuis <info@bnoordhuis.nl>
Reviewed-By: Sakthipriyan Vairamani <thechargingvolcano@gmail.com>
Reviewed-By: James M Snell <jasnell@gmail.com>
italoacasas pushed a commit to italoacasas/node that referenced this pull request Apr 10, 2017
This test ensures that UTF-8 characters can be used in core JavaScript
modules built into Node's binary.

PR-URL: nodejs#11423
Ref: nodejs#11129
Reviewed-By: Anna Henningsen <anna@addaleax.net>
Reviewed-By: Colin Ihrig <cjihrig@gmail.com>
Reviewed-By: Ben Noordhuis <info@bnoordhuis.nl>
Reviewed-By: Sakthipriyan Vairamani <thechargingvolcano@gmail.com>
Reviewed-By: James M Snell <jasnell@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
build Issues and PRs related to build files or the CI. c++ Issues and PRs that require attention from people who are familiar with C++.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants