Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

url: fast path ascii domains, do not run ToASCII #13030

Closed
wants to merge 6 commits into from

Conversation

zimbabao
Copy link
Contributor

To match browser behavior fast path ascii only domains and
do not run ToASCII on them.

Fixes: #12965
Refs: #12966
Refs: whatwg/url#309

Checklist
  • make -j4 test (UNIX), or vcbuild test (Windows) passes
  • tests and/or benchmarks are included
  • documentation is changed or added
  • commit message follows commit guidelines
Affected core subsystem(s)

To match browser behavior fast path ascii only domains and
do not run ToASCII on them.

Fixes: nodejs#12965
Refs: nodejs#12966
Refs: whatwg/url#309
@nodejs-github-bot nodejs-github-bot added c++ Issues and PRs that require attention from people who are familiar with C++. dont-land-on-v4.x whatwg-url Issues and PRs related to the WHATWG URL implementation. labels May 15, 2017
Copy link
Member

@TimothyGu TimothyGu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM other than the nits below:

/cc @domenic, @annevk

src/node_url.cc Outdated
static inline bool IsAllASCII(std::string* input) {
for (size_t n = 0; n < input->size(); n++) {
const char ch = (*input)[n];
if (ch & 0x80) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please declare a new CHAR_TEST in the dedicated section above (grep CHAR_TEST to find where it is):

// https://infra.spec.whatwg.org/#ascii-code-point
CHAR_TEST(8, IsASCIICodePoint, (ch >= '\0' && ch <= '\x7f'))

And use IsASCIICodePoint(ch) here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TimothyGu : Done all changes suggested.

src/node_url.cc Outdated
@@ -829,6 +829,16 @@ static url_host_type ParseOpaqueHost(url_host* host,
return type;
}

static inline bool IsAllASCII(std::string* input) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use a const std::string& here instead of std::string* too.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we can use contains_non_ascii() from string_bytes.cc instead as an optimization?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mscdex : That function is inside anon namespace and is static. Do you off hand know any place where we can refactor it in some utility?.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mscdex : I checked the function, it optimizes for length >=16 . Most domain names will be smaller, so will it help?. We can try and benchmark it.

Copy link
Contributor

@mscdex mscdex May 15, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well if nothing else it's good to reuse existing functionality where possible. That way if further optimizations are ever made to that function, everyone benefits automatically.

Copy link
Contributor

@mscdex mscdex May 15, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, it may be even better to just fold the checking and lowercasing into the same function and loop (reading and lowercasing 4 bytes at a time as long as possible). That way you're only iterating over the string once.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mscdex : I made changes to use existing contains_non_ascii.

Will check about folding checking and lowercasing into one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mscdex : folding both checking and lowercasing will result
either

  1. Code duplication as ContainsNonAscii is used at one more place. or
  2. Or have a function pointer as argument, its a problem with separation of concern. IIRC since function pointers are of statically know function it will get inlined.

Whats your opinion on this?.

I made changes to re-use ContainsNonAscii. Have a look at that when you get time.

src/node_url.cc Outdated
// Match browser behavior for ASCII only domains
// and do not run them through ToASCII algorithm.
if (IsAllASCII(&decoded)) {
// Lowercase aschii domains
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ASCII

src/node_url.cc Outdated
decoded[n] = std::tolower(decoded[n]);
}
} else {
// Then we have to punycode toASCII
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/punycode/Unicode IDNA/

src/node_url.cc Outdated
if (IsAllASCII(&decoded)) {
// Lowercase aschii domains
for (size_t n = 0; n < decoded.size(); n++) {
decoded[n] = std::tolower(decoded[n]);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ASCIILowercase

src/node_url.cc Outdated
// and do not run them through ToASCII algorithm.
if (IsAllASCII(&decoded)) {
// Lowercase aschii domains
for (size_t n = 0; n < decoded.size(); n++) {
Copy link
Contributor

@mscdex mscdex May 15, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For converting to lowercase, consider converting multiple bytes at a time instead of 1 by 1. Here is one good example of how to do that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if this is that performance-sensitive, and even if it is C-level SIMD optimization still sounds like overkill to me.

Copy link
Contributor

@mscdex mscdex May 15, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically it's not "proper" SIMD (e.g. SSE) unless the compiler is somehow smart enough to automatically convert it to that. Anyway, I still think it would be worth benchmarking...

To match browser behavior fast path ascii only domains and
do not run ToASCII on them.

Fixes: nodejs#12965
Refs: nodejs#12966
Refs: whatwg/url#309
@joyeecheung
Copy link
Member

@@ -35,6 +36,13 @@ const tests = require('../fixtures/url-idna.js');
}

{
for (const [i, { ascii, unicode }] of testsHyphenDomains.valid.entries()) {
assert.strictEqual(ascii, domainToASCII(unicode),
Copy link
Member

@joyeecheung joyeecheung May 15, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's only for testing that those domains won't get converted, maybe just make the test cases an array of string and assert.strictEqual(domain, domainToASCII(domain))?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh..never mind, it is supposed to check uppercase characters will be converted to lowercase ones...(is there any in the test cases?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made the change, reverting it. Thanks for catching.

src/node_url.cc Outdated
if (IsAllASCII(decoded)) {
// Lowercase ASCII domains
for (size_t n = 0; n < decoded.size(); n++) {
decoded[n] = ASCIILowercase(decoded[n]);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another optimization would be testing if the decoded domain contains only lowercase characters first. If it does, just use the original string and avoid assignments.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -35,6 +36,13 @@ const tests = require('../fixtures/url-idna.js');
}

{
for (const [i, { ascii, unicode }] of testsHyphenDomains.valid.entries()) {
assert.strictEqual(ascii, domainToASCII(unicode),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh..never mind, it is supposed to check uppercase characters will be converted to lowercase ones...(is there any in the test cases?)

To match browser behavior fast path ascii only domains and
do not run ToASCII on them.

Fixes: nodejs#12965
Refs: nodejs#12966
Refs: whatwg/url#309
To match browser behavior fast path ascii only domains and
do not run ToASCII on them.

Fixes: nodejs#12965
Refs: nodejs#12966
Refs: whatwg/url#309
src/node_url.cc Outdated
@@ -829,6 +834,16 @@ static url_host_type ParseOpaqueHost(url_host* host,
return type;
}

static inline bool IsAllASCII(const std::string& input) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you seen #13030 (comment)?

src/node_url.cc Outdated
if (IsAllASCII(decoded)) {
// Lowercase ASCII domains
for (size_t n = 0; n < decoded.size(); n++) {
if (!IsLowerCaseASCII(decoded[n])) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need for this check; ASCIILowercase will do nothing if the code unit is already lower case or not a letter.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't this prevent idempotent writes to lowercase chars?. @joyeecheung that was the intent right?.

Copy link
Member

@joyeecheung joyeecheung May 15, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was acutually talking about something like

bool needs_lowercase = false;
for (size_t n = 0; n < decoded.size(); n++) {
  if (!IsLowerCaseASCII(decoded[n])) {
      needs_lowercase = true;
      break;
  }
}

if (needs_lowercase) {
  // convert decode
}
// otherwise no need to mutate decode, which should be the common case

To match browser behavior fast path ascii only domains and
do not run ToASCII on them.

Fixes: nodejs#12965
Refs: nodejs#12966
Refs: whatwg/url#309
return false;
}

bool ContainsNonAscii(const char* src, size_t len) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be inlined?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, along with other changes.

src/node_url.cc Outdated
if (!stringutils::ContainsNonAscii(buf, strlen(buf))) {
// Lowercase ASCII domains
for (size_t n = 0; n < decoded.size(); n++) {
if (!IsLowerCaseASCII(decoded[n])) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, the IsLowerCaseASCII is not needed.


#include "env.h"
#include "env-inl.h"
#include "util.h"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My guess is that some of these includes may not be needed.

@zimbabao zimbabao force-pushed the fastpath-ascii-domains-2 branch 2 times, most recently from 0cdbd66 to e0a9a33 Compare May 16, 2017 15:12
To match browser behavior fast path ascii only domains and
do not run ToASCII on them.

Fixes: nodejs#12965
Refs: nodejs#12966
Refs: whatwg/url#309
@mscdex mscdex added the performance Issues and PRs related to the performance of Node.js. label May 16, 2017
@addaleax
Copy link
Member

ping @TimothyGu, have your comments been addressed?

@zimbabao
Copy link
Contributor Author

@addaleax : I have another PR some problem which is as per very latest changes in spec #12966 . I'll close current one. //cc @TimothyGu

@zimbabao zimbabao closed this May 19, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c++ Issues and PRs that require attention from people who are familiar with C++. performance Issues and PRs related to the performance of Node.js. whatwg-url Issues and PRs related to the WHATWG URL implementation.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

WHATWG URL throw parse error for valid domains
6 participants