-
Notifications
You must be signed in to change notification settings - Fork 30.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
url: fast path ascii domains, do not run ToASCII #13030
Conversation
To match browser behavior fast path ascii only domains and do not run ToASCII on them. Fixes: nodejs#12965 Refs: nodejs#12966 Refs: whatwg/url#309
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
src/node_url.cc
Outdated
static inline bool IsAllASCII(std::string* input) { | ||
for (size_t n = 0; n < input->size(); n++) { | ||
const char ch = (*input)[n]; | ||
if (ch & 0x80) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please declare a new CHAR_TEST in the dedicated section above (grep CHAR_TEST to find where it is):
// https://infra.spec.whatwg.org/#ascii-code-point
CHAR_TEST(8, IsASCIICodePoint, (ch >= '\0' && ch <= '\x7f'))
And use IsASCIICodePoint(ch)
here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@TimothyGu : Done all changes suggested.
src/node_url.cc
Outdated
@@ -829,6 +829,16 @@ static url_host_type ParseOpaqueHost(url_host* host, | |||
return type; | |||
} | |||
|
|||
static inline bool IsAllASCII(std::string* input) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can use a const std::string&
here instead of std::string*
too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps we can use contains_non_ascii()
from string_bytes.cc instead as an optimization?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mscdex : That function is inside anon namespace and is static. Do you off hand know any place where we can refactor it in some utility?.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mscdex : I checked the function, it optimizes for length >=16 . Most domain names will be smaller, so will it help?. We can try and benchmark it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well if nothing else it's good to reuse existing functionality where possible. That way if further optimizations are ever made to that function, everyone benefits automatically.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, it may be even better to just fold the checking and lowercasing into the same function and loop (reading and lowercasing 4 bytes at a time as long as possible). That way you're only iterating over the string once.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mscdex : I made changes to use existing contains_non_ascii.
Will check about folding checking and lowercasing into one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mscdex : folding both checking and lowercasing will result
either
- Code duplication as ContainsNonAscii is used at one more place. or
- Or have a function pointer as argument, its a problem with separation of concern. IIRC since function pointers are of statically know function it will get inlined.
Whats your opinion on this?.
I made changes to re-use ContainsNonAscii. Have a look at that when you get time.
src/node_url.cc
Outdated
// Match browser behavior for ASCII only domains | ||
// and do not run them through ToASCII algorithm. | ||
if (IsAllASCII(&decoded)) { | ||
// Lowercase aschii domains |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ASCII
src/node_url.cc
Outdated
decoded[n] = std::tolower(decoded[n]); | ||
} | ||
} else { | ||
// Then we have to punycode toASCII |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/punycode/Unicode IDNA/
src/node_url.cc
Outdated
if (IsAllASCII(&decoded)) { | ||
// Lowercase aschii domains | ||
for (size_t n = 0; n < decoded.size(); n++) { | ||
decoded[n] = std::tolower(decoded[n]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ASCIILowercase
src/node_url.cc
Outdated
// and do not run them through ToASCII algorithm. | ||
if (IsAllASCII(&decoded)) { | ||
// Lowercase aschii domains | ||
for (size_t n = 0; n < decoded.size(); n++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For converting to lowercase, consider converting multiple bytes at a time instead of 1 by 1. Here is one good example of how to do that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if this is that performance-sensitive, and even if it is C-level SIMD optimization still sounds like overkill to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Technically it's not "proper" SIMD (e.g. SSE) unless the compiler is somehow smart enough to automatically convert it to that. Anyway, I still think it would be worth benchmarking...
To match browser behavior fast path ascii only domains and do not run ToASCII on them. Fixes: nodejs#12965 Refs: nodejs#12966 Refs: whatwg/url#309
@@ -35,6 +36,13 @@ const tests = require('../fixtures/url-idna.js'); | |||
} | |||
|
|||
{ | |||
for (const [i, { ascii, unicode }] of testsHyphenDomains.valid.entries()) { | |||
assert.strictEqual(ascii, domainToASCII(unicode), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it's only for testing that those domains won't get converted, maybe just make the test cases an array of string and assert.strictEqual(domain, domainToASCII(domain))
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh..never mind, it is supposed to check uppercase characters will be converted to lowercase ones...(is there any in the test cases?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made the change, reverting it. Thanks for catching.
src/node_url.cc
Outdated
if (IsAllASCII(decoded)) { | ||
// Lowercase ASCII domains | ||
for (size_t n = 0; n < decoded.size(); n++) { | ||
decoded[n] = ASCIILowercase(decoded[n]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another optimization would be testing if the decoded domain contains only lowercase characters first. If it does, just use the original string and avoid assignments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
@@ -35,6 +36,13 @@ const tests = require('../fixtures/url-idna.js'); | |||
} | |||
|
|||
{ | |||
for (const [i, { ascii, unicode }] of testsHyphenDomains.valid.entries()) { | |||
assert.strictEqual(ascii, domainToASCII(unicode), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh..never mind, it is supposed to check uppercase characters will be converted to lowercase ones...(is there any in the test cases?)
To match browser behavior fast path ascii only domains and do not run ToASCII on them. Fixes: nodejs#12965 Refs: nodejs#12966 Refs: whatwg/url#309
To match browser behavior fast path ascii only domains and do not run ToASCII on them. Fixes: nodejs#12965 Refs: nodejs#12966 Refs: whatwg/url#309
src/node_url.cc
Outdated
@@ -829,6 +834,16 @@ static url_host_type ParseOpaqueHost(url_host* host, | |||
return type; | |||
} | |||
|
|||
static inline bool IsAllASCII(const std::string& input) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you seen #13030 (comment)?
src/node_url.cc
Outdated
if (IsAllASCII(decoded)) { | ||
// Lowercase ASCII domains | ||
for (size_t n = 0; n < decoded.size(); n++) { | ||
if (!IsLowerCaseASCII(decoded[n])) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need for this check; ASCIILowercase will do nothing if the code unit is already lower case or not a letter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Won't this prevent idempotent writes to lowercase chars?. @joyeecheung that was the intent right?.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was acutually talking about something like
bool needs_lowercase = false;
for (size_t n = 0; n < decoded.size(); n++) {
if (!IsLowerCaseASCII(decoded[n])) {
needs_lowercase = true;
break;
}
}
if (needs_lowercase) {
// convert decode
}
// otherwise no need to mutate decode, which should be the common case
To match browser behavior fast path ascii only domains and do not run ToASCII on them. Fixes: nodejs#12965 Refs: nodejs#12966 Refs: whatwg/url#309
src/string_utils.cc
Outdated
return false; | ||
} | ||
|
||
bool ContainsNonAscii(const char* src, size_t len) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be inlined?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, along with other changes.
src/node_url.cc
Outdated
if (!stringutils::ContainsNonAscii(buf, strlen(buf))) { | ||
// Lowercase ASCII domains | ||
for (size_t n = 0; n < decoded.size(); n++) { | ||
if (!IsLowerCaseASCII(decoded[n])) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, the IsLowerCaseASCII
is not needed.
src/string_utils.h
Outdated
|
||
#include "env.h" | ||
#include "env-inl.h" | ||
#include "util.h" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My guess is that some of these includes may not be needed.
0cdbd66
to
e0a9a33
Compare
To match browser behavior fast path ascii only domains and do not run ToASCII on them. Fixes: nodejs#12965 Refs: nodejs#12966 Refs: whatwg/url#309
e0a9a33
to
861604f
Compare
ping @TimothyGu, have your comments been addressed? |
@addaleax : I have another PR some problem which is as per very latest changes in spec #12966 . I'll close current one. //cc @TimothyGu |
To match browser behavior fast path ascii only domains and
do not run ToASCII on them.
Fixes: #12965
Refs: #12966
Refs: whatwg/url#309
Checklist
make -j4 test
(UNIX), orvcbuild test
(Windows) passesAffected core subsystem(s)