Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

url: fast path ascii domains, do not run ToASCII #13030

Closed
wants to merge 6 commits into from
Closed
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
28 changes: 25 additions & 3 deletions src/node_url.cc
Original file line number Diff line number Diff line change
Expand Up @@ -130,6 +130,9 @@ enum url_error_cb_args {
return str.length() >= 2 && name(str[0], str[1]); \
}

// https://infra.spec.whatwg.org/#ascii-code-point
CHAR_TEST(8, IsASCIICodePoint, (ch >= '\0' && ch <= '\x7f'))

// https://infra.spec.whatwg.org/#ascii-tab-or-newline
CHAR_TEST(8, IsASCIITabOrNewline, (ch == '\t' || ch == '\n' || ch == '\r'))

Expand Down Expand Up @@ -829,6 +832,16 @@ static url_host_type ParseOpaqueHost(url_host* host,
return type;
}

static inline bool IsAllASCII(const std::string& input) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you seen #13030 (comment)?

for (size_t n = 0; n < input.size(); n++) {
const char ch = input[n];
if (!IsASCIICodePoint(ch)) {
return false;
}
}
return true;
}

static url_host_type ParseHost(url_host* host,
const char* input,
size_t length,
Expand All @@ -853,9 +866,18 @@ static url_host_type ParseHost(url_host* host,
// First, we have to percent decode
PercentDecode(input, length, &decoded);

// Then we have to punycode toASCII
if (!ToASCII(&decoded, &decoded))
goto end;
// Match browser behavior for ASCII only domains
// and do not run them through ToASCII algorithm.
if (IsAllASCII(decoded)) {
// Lowercase ASCII domains
for (size_t n = 0; n < decoded.size(); n++) {
Copy link
Contributor

@mscdex mscdex May 15, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For converting to lowercase, consider converting multiple bytes at a time instead of 1 by 1. Here is one good example of how to do that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if this is that performance-sensitive, and even if it is C-level SIMD optimization still sounds like overkill to me.

Copy link
Contributor

@mscdex mscdex May 15, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically it's not "proper" SIMD (e.g. SSE) unless the compiler is somehow smart enough to automatically convert it to that. Anyway, I still think it would be worth benchmarking...

decoded[n] = ASCIILowercase(decoded[n]);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another optimization would be testing if the decoded domain contains only lowercase characters first. If it does, just use the original string and avoid assignments.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

}
} else {
// Then we have to Unicode IDNA toASCII
if (!ToASCII(&decoded, &decoded))
goto end;
}

// If any of the following characters are still present, we have to fail
for (size_t n = 0; n < decoded.size(); n++) {
Expand Down
27 changes: 27 additions & 0 deletions test/fixtures/url-domains-with-hyphens.js
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
'use strict';

module.exports = {
valid: [
// URLs with hyphen
{
ascii: 'r4---sn-a5mlrn7s.gevideo.com',
unicode: 'r4---sn-a5mlrn7s.gevideo.com'
},
{
ascii: '-sn-a5mlrn7s.gevideo.com',
unicode: '-sn-a5mlrn7s.gevideo.com'
},
{
ascii: 'sn-a5mlrn7s-.gevideo.com',
unicode: 'sn-a5mlrn7s-.gevideo.com'
},
{
ascii: '-sn-a5mlrn7s-.gevideo.com',
unicode: '-sn-a5mlrn7s-.gevideo.com'
},
{
ascii: '-sn--a5mlrn7s-.gevideo.com',
unicode: '-sn--a5mlrn7s-.gevideo.com'
}
]
}
8 changes: 8 additions & 0 deletions test/parallel/test-whatwg-url-domainto.js
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@ const { domainToASCII, domainToUnicode } = require('url');

// Tests below are not from WPT.
const tests = require('../fixtures/url-idna.js');
const testsHyphenDomains = require('../fixtures/url-domains-with-hyphens.js');

{
const expectedError = common.expectsError(
Expand All @@ -34,6 +35,13 @@ const tests = require('../fixtures/url-idna.js');
}
}

{
for (const [i, { ascii, unicode }] of testsHyphenDomains.valid.entries()) {
assert.strictEqual(ascii, domainToASCII(unicode),
Copy link
Member

@joyeecheung joyeecheung May 15, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's only for testing that those domains won't get converted, maybe just make the test cases an array of string and assert.strictEqual(domain, domainToASCII(domain))?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh..never mind, it is supposed to check uppercase characters will be converted to lowercase ones...(is there any in the test cases?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made the change, reverting it. Thanks for catching.

`domainToASCII(${i + 1})`);
}
}

{
const convertFunc = {
ascii: domainToASCII,
Expand Down