-
Notifications
You must be signed in to change notification settings - Fork 29.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
src: support UTF-8 in compiled-in JS source files #11129
Conversation
I rather feared this approach may be a bit too forward-looking for some compilers... I'll see if I can dumb it down. |
tools/js2c.py
Outdated
# Treat non-ASCII as UTF-8 and convert it to UTF-16. | ||
if any(ord(c) > 127 for c in lines): | ||
ctype = 'uint16_t' | ||
data = map(ord, lines.decode('utf-8').encode('utf-16be')) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe 'utf-16be' if sys.byteorder == 'little' else 'utf-16le'
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The big-endian encoding is intentional, the line below stitches the individual octets together again into uint16s. (.encode()
returns a byte string in case you're wondering.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I see. Right.
Just curious, have there been specific issues or is this purely preventative? |
What's the reason for using UTF-16 rather than UTF-8 (out of interest) to store the files in the binary? |
I would have to assume it is because the external string API does not directly support UTF8. That is, if you look at |
@gibfahn so that they can be used as external strings. UTF-8 strings must be copied to V8's managed memory. See #5458 by @bnoordhuis. |
src/node_javascript.cc
Outdated
using v8::NewStringType; | ||
using v8::Object; | ||
using v8::String; | ||
|
||
template <typename T, size_t N, T P> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tiny suggestion: Could we be a bit more explicit and rename P
to Ptr
or Pointer
? 😄
@bnoordhuis ping. Did you have time to dumb it down? 😉 This patch seems to work on GCC 4.8+. Diff
diff --git a/src/node_javascript.cc b/src/node_javascript.cc
index 8c665381db..07a0daaa83 100644
--- a/src/node_javascript.cc
+++ b/src/node_javascript.cc
@@ -16,16 +16,16 @@ using v8::String;
template <typename T, size_t N, T P>
struct ExternalStringResource;
-template <size_t N, const char (&P)[N]>
-struct ExternalStringResource<const char[N], N, P>
+template <size_t N, const char* P>
+struct ExternalStringResource<const char*, N, P>
: public String::ExternalOneByteStringResource {
const char* data() const override { return P; }
size_t length() const override { return N; }
void Dispose() override { /* Default calls `delete this`. */ }
};
-template <size_t N, const uint16_t (&P)[N]>
-struct ExternalStringResource<const uint16_t[N], N, P>
+template <size_t N, const uint16_t* P>
+struct ExternalStringResource<const uint16_t*, N, P>
: public String::ExternalStringResource {
const uint16_t* data() const override { return P; }
size_t length() const override { return N; }
@@ -34,7 +34,7 @@ struct ExternalStringResource<const uint16_t[N], N, P>
// id##_data is defined in node_natives.h.
#define V(id) \
- static ExternalStringResource<decltype(id##_data), \
+ static ExternalStringResource<decltype(&id##_data[0]), \
arraysize(id##_data), \
id##_data> id##_external_data;
NODE_NATIVES_MAP(V)
diff --git a/tools/js2c.py b/tools/js2c.py
index e6c56cf075..8104e50e5a 100755
--- a/tools/js2c.py
+++ b/tools/js2c.py
@@ -189,7 +189,7 @@ NODE_NATIVES_MAP = """\
SOURCES = """\
static const uint8_t {escaped_id}_name[] = {{
{name}}};
-static const {ctype} {escaped_id}_data[] = {{
+static constexpr const {ctype} {escaped_id}_data[] = {{
{data}}};
"""
|
@aqrln Thanks. I tried a couple of different approaches but I ended up settling for something that's pretty similar to your patch. :-) I added a second commit that replaces a few U+2019 quotes with ASCII quotes so that their files can be stored as one-byte strings. Can I get a quick LGTM or two? EDIT: The CI scoreboard is currently linking to a previous CI build. The correct one is https://ci.nodejs.org/job/node-test-commit/7713/. |
Probably one for another PR, but this sounds like a good candidate for a lint rule. |
Interesting VS build failure...
I suppose the compiler is technically correct, and wasn't that the best kind of correct? EDIT: New attempt: https://ci.nodejs.org/job/node-test-pull-request/6243/ |
Different take: moved all the logic to js2c.py. CI: https://ci.nodejs.org/job/node-test-pull-request/6357/ EDIT: Sigh...
|
CI is finally green except for that silly issue where the ARM buildbots don't report their status properly. @addaleax The PR changed quite a bit, perhaps you want to take another look? |
'inputs': [ | ||
'<@(library_files)', | ||
'./config.gypi', | ||
], | ||
'outputs': [ | ||
'<(SHARED_INTERMEDIATE_DIR)/node_natives.h', | ||
'<(SHARED_INTERMEDIATE_DIR)/node_javascript.cc', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You'd expect INTERMEDIATE_DIR to work here but it doesn't: js2c generates the file but the build doesn't pick it up (and I updated the 'sources' list in case you're wondering.) I decided to leave well enough alone for now.
Detect it when source files in lib/ are not ASCII. Decode them as UTF-8 and store them as UTF-16 in the binary so they can be used as external string resources without non-ASCII characters getting mangled. Fixes: nodejs#10673 PR-URL: nodejs#11129 Reviewed-By: Anna Henningsen <anna@addaleax.net> Reviewed-By: James M Snell <jasnell@gmail.com>
The previous commit stores baked-in files with non-ASCII characters as UTF-16. Replace the \u2019 with a regular quote character so that the files they're in can be stored as one-byte strings. The UTF-16 functionality is still tested by the Unicode diagram in lib/timers.js. PR-URL: nodejs#11129 Reviewed-By: Anna Henningsen <anna@addaleax.net> Reviewed-By: James M Snell <jasnell@gmail.com>
Detect it when source files in lib/ are not ASCII. Decode them as UTF-8 and store them as UTF-16 in the binary so they can be used as external string resources without non-ASCII characters getting mangled. Fixes: nodejs#10673 PR-URL: nodejs#11129 Reviewed-By: Anna Henningsen <anna@addaleax.net> Reviewed-By: James M Snell <jasnell@gmail.com>
The previous commit stores baked-in files with non-ASCII characters as UTF-16. Replace the \u2019 with a regular quote character so that the files they're in can be stored as one-byte strings. The UTF-16 functionality is still tested by the Unicode diagram in lib/timers.js. PR-URL: nodejs#11129 Reviewed-By: Anna Henningsen <anna@addaleax.net> Reviewed-By: James M Snell <jasnell@gmail.com>
In order to allow using Unicode characters inside comments of built-in JavaScript libraries without forcing them to be stored as UTF-16 data in Node's binary, update the tooling to strip comments during build process. All line breaks are preserved so that line numbers in stack traces aren't broken. Refs: nodejs#11129 Refs: nodejs#11371 (comment)
This test ensures that UTF-8 characters can be used in core JavaScript modules built into Node's binary. Refs: nodejs#11129
Notable changes: * deps: * update V8 to 5.5 (Michaël Zasso) [#11029](#11029) * upgrade libuv to 1.11.0 (cjihrig) [#11094](#11094) * add node-inspect 1.10.2 (Jan Krems) [#10187](#10187) * lib: build `node inspect` into `node` (Anna Henningsen) [#10187](#10187) * crypto: Remove expired certs from CNNIC whitelist (Shigeki Ohtsu) [#9469](#9469) * inspector: add --inspect-brk (Josh Gavant) [#11149](#11149) * fs: allow WHATWG URL and file: URLs as paths (James M Snell) [#10739](#10739) * src: support UTF-8 in compiled-in JS source files (Ben Noordhuis) [#11129](#11129) * url: extend url.format to support WHATWG URL (James M Snell) [#10857](#10857) PR-URL: #11185
Notable changes: * deps: * update V8 to 5.5 (Michaël Zasso) [nodejs#11029](nodejs#11029) * upgrade libuv to 1.11.0 (cjihrig) [nodejs#11094](nodejs#11094) * add node-inspect 1.10.4 (Jan Krems) [nodejs#10187](nodejs#10187) * upgrade zlib to 1.2.11 (Sam Roberts) [nodejs#10980](nodejs#10980) * lib: build `node inspect` into `node` (Anna Henningsen) [nodejs#10187](nodejs#10187) * crypto: Remove expired certs from CNNIC whitelist (Shigeki Ohtsu) [nodejs#9469](nodejs#9469) * inspector: add --inspect-brk (Josh Gavant) [nodejs#11149](nodejs#11149) * fs: allow WHATWG URL objects as paths (James M Snell) [nodejs#10739](nodejs#10739) * src: support UTF-8 in compiled-in JS source files (Ben Noordhuis) [nodejs#11129](nodejs#11129) * url: extend url.format to support WHATWG URL (James M Snell) [nodejs#10857](nodejs#10857) PR-URL: nodejs#11185
Notable changes: * deps: * update V8 to 5.5 (Michaël Zasso) [nodejs#11029](nodejs#11029) * upgrade libuv to 1.11.0 (cjihrig) [nodejs#11094](nodejs#11094) * add node-inspect 1.10.4 (Jan Krems) [nodejs#10187](nodejs#10187) * upgrade zlib to 1.2.11 (Sam Roberts) [nodejs#10980](nodejs#10980) * lib: build `node inspect` into `node` (Anna Henningsen) [nodejs#10187](nodejs#10187) * crypto: Remove expired certs from CNNIC whitelist (Shigeki Ohtsu) [nodejs#9469](nodejs#9469) * inspector: add --inspect-brk (Josh Gavant) [nodejs#11149](nodejs#11149) * fs: allow WHATWG URL objects as paths (James M Snell) [nodejs#10739](nodejs#10739) * src: support UTF-8 in compiled-in JS source files (Ben Noordhuis) [nodejs#11129](nodejs#11129) * url: extend url.format to support WHATWG URL (James M Snell) [nodejs#10857](nodejs#10857) PR-URL: nodejs#11185
Notable changes: * deps: * update V8 to 5.5 (Michaël Zasso) [#11029](nodejs/node#11029) * upgrade libuv to 1.11.0 (cjihrig) [#11094](nodejs/node#11094) * add node-inspect 1.10.4 (Jan Krems) [#10187](nodejs/node#10187) * upgrade zlib to 1.2.11 (Sam Roberts) [#10980](nodejs/node#10980) * lib: build `node inspect` into `node` (Anna Henningsen) [#10187](nodejs/node#10187) * crypto: Remove expired certs from CNNIC whitelist (Shigeki Ohtsu) [#9469](nodejs/node#9469) * inspector: add --inspect-brk (Josh Gavant) [#11149](nodejs/node#11149) * fs: allow WHATWG URL objects as paths (James M Snell) [#10739](nodejs/node#10739) * src: support UTF-8 in compiled-in JS source files (Ben Noordhuis) [#11129](nodejs/node#11129) * url: extend url.format to support WHATWG URL (James M Snell) [#10857](nodejs/node#10857) PR-URL: nodejs/node#11185 Signed-off-by: Ilkka Myller <ilkka.myller@nodefield.com>
Needs a backport PR to land in v4. |
Detect it when source files in lib/ are not ASCII. Decode them as UTF-8 and store them as UTF-16 in the binary so they can be used as external string resources without non-ASCII characters getting mangled. Fixes: #10673 PR-URL: #11129 Reviewed-By: Anna Henningsen <anna@addaleax.net> Reviewed-By: James M Snell <jasnell@gmail.com>
Detect it when source files in lib/ are not ASCII. Decode them as UTF-8 and store them as UTF-16 in the binary so they can be used as external string resources without non-ASCII characters getting mangled. Fixes: #10673 PR-URL: #11129 Reviewed-By: Anna Henningsen <anna@addaleax.net> Reviewed-By: James M Snell <jasnell@gmail.com>
This test ensures that UTF-8 characters can be used in core JavaScript modules built into Node's binary. PR-URL: #11423 Ref: #11129 Reviewed-By: Anna Henningsen <anna@addaleax.net> Reviewed-By: Colin Ihrig <cjihrig@gmail.com> Reviewed-By: Ben Noordhuis <info@bnoordhuis.nl> Reviewed-By: Sakthipriyan Vairamani <thechargingvolcano@gmail.com> Reviewed-By: James M Snell <jasnell@gmail.com>
This test ensures that UTF-8 characters can be used in core JavaScript modules built into Node's binary. PR-URL: nodejs#11423 Ref: nodejs#11129 Reviewed-By: Anna Henningsen <anna@addaleax.net> Reviewed-By: Colin Ihrig <cjihrig@gmail.com> Reviewed-By: Ben Noordhuis <info@bnoordhuis.nl> Reviewed-By: Sakthipriyan Vairamani <thechargingvolcano@gmail.com> Reviewed-By: James M Snell <jasnell@gmail.com>
Detect it when source files in lib/ are not ASCII. Decode them as UTF-8
and store them as UTF-16 in the binary so they can be used as external
string resources without non-ASCII characters getting mangled.
CI: https://ci.nodejs.org/job/node-test-pull-request/6176/