-
Notifications
You must be signed in to change notification settings - Fork 30.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
url: fast path ascii domains, do not run ToASCII #13030
Changes from 2 commits
8233c34
35e901d
a13c377
d816988
e86297c
861604f
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -130,6 +130,9 @@ enum url_error_cb_args { | |
return str.length() >= 2 && name(str[0], str[1]); \ | ||
} | ||
|
||
// https://infra.spec.whatwg.org/#ascii-code-point | ||
CHAR_TEST(8, IsASCIICodePoint, (ch >= '\0' && ch <= '\x7f')) | ||
|
||
// https://infra.spec.whatwg.org/#ascii-tab-or-newline | ||
CHAR_TEST(8, IsASCIITabOrNewline, (ch == '\t' || ch == '\n' || ch == '\r')) | ||
|
||
|
@@ -829,6 +832,16 @@ static url_host_type ParseOpaqueHost(url_host* host, | |
return type; | ||
} | ||
|
||
static inline bool IsAllASCII(const std::string& input) { | ||
for (size_t n = 0; n < input.size(); n++) { | ||
const char ch = input[n]; | ||
if (!IsASCIICodePoint(ch)) { | ||
return false; | ||
} | ||
} | ||
return true; | ||
} | ||
|
||
static url_host_type ParseHost(url_host* host, | ||
const char* input, | ||
size_t length, | ||
|
@@ -853,9 +866,18 @@ static url_host_type ParseHost(url_host* host, | |
// First, we have to percent decode | ||
PercentDecode(input, length, &decoded); | ||
|
||
// Then we have to punycode toASCII | ||
if (!ToASCII(&decoded, &decoded)) | ||
goto end; | ||
// Match browser behavior for ASCII only domains | ||
// and do not run them through ToASCII algorithm. | ||
if (IsAllASCII(decoded)) { | ||
// Lowercase ASCII domains | ||
for (size_t n = 0; n < decoded.size(); n++) { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For converting to lowercase, consider converting multiple bytes at a time instead of 1 by 1. Here is one good example of how to do that. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm not sure if this is that performance-sensitive, and even if it is C-level SIMD optimization still sounds like overkill to me. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Technically it's not "proper" SIMD (e.g. SSE) unless the compiler is somehow smart enough to automatically convert it to that. Anyway, I still think it would be worth benchmarking... |
||
decoded[n] = ASCIILowercase(decoded[n]); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Another optimization would be testing if the decoded domain contains only lowercase characters first. If it does, just use the original string and avoid assignments. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Done. |
||
} | ||
} else { | ||
// Then we have to Unicode IDNA toASCII | ||
if (!ToASCII(&decoded, &decoded)) | ||
goto end; | ||
} | ||
|
||
// If any of the following characters are still present, we have to fail | ||
for (size_t n = 0; n < decoded.size(); n++) { | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,27 @@ | ||
'use strict'; | ||
|
||
module.exports = { | ||
valid: [ | ||
// URLs with hyphen | ||
{ | ||
ascii: 'r4---sn-a5mlrn7s.gevideo.com', | ||
unicode: 'r4---sn-a5mlrn7s.gevideo.com' | ||
}, | ||
{ | ||
ascii: '-sn-a5mlrn7s.gevideo.com', | ||
unicode: '-sn-a5mlrn7s.gevideo.com' | ||
}, | ||
{ | ||
ascii: 'sn-a5mlrn7s-.gevideo.com', | ||
unicode: 'sn-a5mlrn7s-.gevideo.com' | ||
}, | ||
{ | ||
ascii: '-sn-a5mlrn7s-.gevideo.com', | ||
unicode: '-sn-a5mlrn7s-.gevideo.com' | ||
}, | ||
{ | ||
ascii: '-sn--a5mlrn7s-.gevideo.com', | ||
unicode: '-sn--a5mlrn7s-.gevideo.com' | ||
} | ||
] | ||
} |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -11,6 +11,7 @@ const { domainToASCII, domainToUnicode } = require('url'); | |
|
||
// Tests below are not from WPT. | ||
const tests = require('../fixtures/url-idna.js'); | ||
const testsHyphenDomains = require('../fixtures/url-domains-with-hyphens.js'); | ||
|
||
{ | ||
const expectedError = common.expectsError( | ||
|
@@ -34,6 +35,13 @@ const tests = require('../fixtures/url-idna.js'); | |
} | ||
} | ||
|
||
{ | ||
for (const [i, { ascii, unicode }] of testsHyphenDomains.valid.entries()) { | ||
assert.strictEqual(ascii, domainToASCII(unicode), | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If it's only for testing that those domains won't get converted, maybe just make the test cases an array of string and There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Oh..never mind, it is supposed to check uppercase characters will be converted to lowercase ones...(is there any in the test cases?) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Made the change, reverting it. Thanks for catching. |
||
`domainToASCII(${i + 1})`); | ||
} | ||
} | ||
|
||
{ | ||
const convertFunc = { | ||
ascii: domainToASCII, | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you seen #13030 (comment)?