url: ignore IDN errors when domainname have two hyphens #12966

zimbabao · 2017-05-11T07:44:03Z

There are valid domain names with hyphens at 3 and 4th position, new
node WHATWG URL parser was failing for it assume its an invalid IDN.

Fixes: #12965

Checklist

make -j4 test (UNIX), or vcbuild test (Windows) passes
tests and/or benchmarks are included
documentation is changed or added
commit message follows commit guidelines

Affected core subsystem(s)

watilde

LGTM. This seems to be how the other URL parsers work.

TimothyGu · 2017-05-11T07:52:08Z

test/parallel/test-url-parse-format.js

@@ -487,6 +487,16 @@ const parseTests = {
    path: '/bar'
  },

+  'https://r4---sn-a5mlrn7s.gevideo.com/bar': {


Instead of adding a test here, can you add an entry to test/fixtures/url-idna.js instead? Those cases are tested through test-whatwg-url-domainto.js. Setting mode to 'ascii' should be enough in this case.

Also, please add test cases with leading and trailing hyphens (and both) as well.

TimothyGu · 2017-05-11T08:02:17Z

src/node_i18n.cc

@@ -516,6 +516,7 @@ int32_t ToASCII(MaybeStackBuffer<char>* buf,
  info.errors &= ~UIDNA_ERROR_EMPTY_LABEL;
  info.errors &= ~UIDNA_ERROR_LABEL_TOO_LONG;
  info.errors &= ~UIDNA_ERROR_DOMAIN_NAME_TOO_LONG;
+  info.errors &= ~UIDNA_ERROR_HYPHEN_3_4;


I suggest the following code, which filters out several more conditions mandated by the UTS#46 draft, and adds an explicit comment explaining why this is:

// These error conditions are mandated unconditionally by UTS #46 version // 9.0.0 (rev. 17), but were found to be incompatible with many actual domain // names in the wild. As such, in the current UTS #46 draft (rev. 18) these // checks are made optional depending on the CheckHyphens flag, which will be // disabled in WHATWG URL's "domain to ASCII" algorithm as soon as the UTS #46 // draft becomes standard. // Refs: // - https://github.com/whatwg/url/issues/53 // - http://www.unicode.org/review/pri317/ // - http://www.unicode.org/reports/tr46/tr46-18.html info.errors &= ~UIDNA_ERROR_HYPHEN_3_4; info.errors &= ~UIDNA_ERROR_LEADING_HYPHEN; info.errors &= ~UIDNA_ERROR_TRAILING_HYPHEN;

I also suspect that these lines of code need to be duplicated for the ToUnicode method in the file, since there is a set of identical changes to ToUnicode in the UTS#46 draft.

Yes. Will do.
I checked those too. But was not able to find any real world example of those.

@TimothyGu : Made changes as suggested. Filtering errors for trailing and leading hyphens.

domenic · 2017-05-11T16:31:58Z

What is the policy for landing changes that aren't spec-compliant? Or is the intention to move the spec first?

TimothyGu · 2017-05-11T18:48:44Z

src/node_i18n.cc

@@ -516,6 +531,22 @@ int32_t ToASCII(MaybeStackBuffer<char>* buf,
  info.errors &= ~UIDNA_ERROR_EMPTY_LABEL;
  info.errors &= ~UIDNA_ERROR_LABEL_TOO_LONG;
  info.errors &= ~UIDNA_ERROR_DOMAIN_NAME_TOO_LONG;
+  info.errors &= ~UIDNA_ERROR_HYPHEN_3_4;


You can remove this line now that you've added my suggestion :)

TimothyGu · 2017-05-11T18:48:59Z

src/node_i18n.cc

+  // 9.0.0 (rev. 17), but were found to be incompatible with many actual domain
+  // names in the wild. As such, in the current UTS #46 draft (rev. 18) these
+  // checks are made optional depending on the CheckHyphens flag, which will be
+  // disabled in WHATWG URL's "domain to ASCII" algorithm as soon as the UTS #46


domain to Unicode

Thanks. done all changes.

TimothyGu · 2017-05-11T19:07:25Z

@domenic I don't think we have encountered such situations before, but generally yes I think the intention is to change spec first. In this case, however, there is a disparity between the spec and what browsers do, so I think it is justified to make this change, given that upstream is aware of the problem.

domenic · 2017-05-11T19:10:21Z

Yeah, I guess that makes sense, although with luck we can use this to prod @annevk into action on a spec fix to at least match what Node.js intends to do here.

Note that you're not matching browsers with this patch, as per whatwg/url#267 they seem to do something quite different.

jasnell · 2017-05-11T22:38:03Z

Hmm... I would definitely like to see this fixed but we need to make sure that (a) the behavior matches the browsers and (b) the spec issues gets resolved. I think I'd rather hold off on landing until we get confirmation on (a) and acknowledgement that (b) will happen.

zimbabao · 2017-05-12T02:31:31Z

@jasnell : I just created a jsbin with all above style urls and https://jsbin.com/negusoweyu/edit?html,js,console,output . Browser (I tested on chrome, Safari, Firefox). They all parse these domains and resolves them and fires request. I added entries in /etc/hosts to get them resolved.

Do you thing thats enough or any other test would be needed?. Let me know.

domenic · 2017-05-12T03:30:16Z

To be clear, from reading browser source code and running more exhaustive tests than those simple ones, you'll find that browsers do not apply ToASCII at all for domains with only-ASCII characters. That is not what this PR does. (Nor is it what the spec does.)

See more detail at the issue whatwg/url#267 I linked to above.

annevk · 2017-05-12T03:59:36Z

It's unfortunate that what browsers do doesn't make any sense. It's the kind Latin-alphabet-favoritism we try to get rid of, although maybe it's minor enough not to matter...

But yeah, if you want to match what browsers do, only run ToASCII if you found a code point > U+007F. (I think that's a per domain determination, not per domain label.)

I don't really want to change the URL standard until I heard from implementers what we can actually do here, although I should probably add a warning.

zimbabao · 2017-05-12T15:57:05Z

@domenic , @annevk , @TimothyGu : Will look at the browser code, tests and submit a change.
From @annevk's blog https://annevankesteren.nl/2017/02/idna-toascii-differences , all tested browsers seems to agree when domain label is ascii, so not running toASCII on ASCII only string is justified.

Shall I file tickets to browser vendors for other cases?, or you are already having that discussion?.

domenic · 2017-05-12T16:19:49Z

We're already having that discussion... again, please look at the issue I've pointed to twice now :)

annevk · 2017-05-12T16:25:16Z

whatwg/url#309 has some additional information.

joyeecheung · 2017-05-12T16:50:34Z

Looks like we will need to wait for the spec for the time being. We have a "Waiting on Spec" column in the Github project for WHATWG URL...for now I'll move this one there.

annevk · 2017-05-13T04:39:25Z

FWIW, if it's possible for you to iterate quickly I would recommend implementing what browsers do now ("fast path" for ASCII) and later align with any standard changes if they end up being different. It doesn't seem worth it to inconvenience end users over this.

zimbabao · 2017-05-13T04:45:38Z

@annevk : Thats what I was planning to do. Will do it over weekend.

zimbabao · 2017-05-15T00:39:52Z

@annevk : I tried to get the fast toASCII for all ascii domain. But corresponding ToUnicode will still fail for domain names with hyphens at 3&4 places or with trailing or leading hyphens.

We can special case this by not running ToUnicode on domain labels not having "xn--" prefix.

Will send a different PR

To match browser behavior fast path ascii only domains and do not run ToASCII on them. Fixes: nodejs#12965 Refs: nodejs#12966 Refs: whatwg/url#309

annevk · 2017-05-15T06:15:47Z

ToUnicode can never fail. It can only signal errors, but it never fails. See http://www.unicode.org/reports/tr46/#ToUnicode for details.

TimothyGu · 2017-05-15T06:19:54Z

@annevk, hmm I missed that line when reading (step 2). Will note in the future.

To match browser behavior fast path ascii only domains and do not run ToASCII on them. Fixes: nodejs#12965 Refs: nodejs#12966 Refs: whatwg/url#309

zimbabao · 2017-05-18T16:31:19Z

Reopening this based on this comment.

TimothyGu · 2017-05-18T23:20:27Z

src/node_i18n.cc

+  info.errors &= ~UIDNA_ERROR_HYPHEN_3_4;
+  info.errors &= ~UIDNA_ERROR_LEADING_HYPHEN;
+  info.errors &= ~UIDNA_ERROR_TRAILING_HYPHEN;
+


Unfortunately I was mistaken in my understanding of ToUnicode, which means that the entire block (ll. 464-485) shouldn't be necessary. While at it, can you delete this block, the erroneous || (!lenient && info.errors != 0) condition below, as well as the now-unnecessary lenient parameter?

TimothyGu

This PR LGTM except the ToUnicode issue above, but we can fix that at a later date if this PR should get in ASAP.

There are valid domain names with hyphens at 3 and 4th position, new node WHATWG URL parser was failing for it assume its an invalid IDN. Fixes: nodejs#12965

There are valid domain names with hyphens at 3 and 4th position, new node WHATWG URL parser was failing for it assume its an invalid IDN. Also filters IDN errors when domain label start or end with hyphen. Fixes: nodejs#12965 Refs: https://www.icann.org/news/announcement-2000-01-07-en

There are valid domain names with hyphens at 3 and 4th position, new node WHATWG URL parser was failing for it assume its an invalid IDN. Also filters IDN errors when domain label start or end with hyphen. Also fix error in ToUnicode Fixes: nodejs#12965 Refs: https://www.icann.org/news/announcement-2000-01-07-en Refs: whatwg/url#309 (comment)

zimbabao · 2017-05-19T00:55:18Z

We had to do hacks to get around this problems with try/catch and using older url module.

zimbabao · 2017-05-19T16:54:50Z

@TimothyGu : I made the above changes.

Also based on commit to #309 shall we add checkHyphen param, will do that as separate PR.

TimothyGu

Even better! LGTM.

zimbabao · 2017-05-20T04:02:47Z

@TimothyGu, @watilde : Can one you start a CI run for this please?.

TimothyGu · 2017-05-20T04:27:20Z

CI: https://ci.nodejs.org/job/node-test-pull-request/8202/

zimbabao · 2017-05-20T05:50:00Z

@TimothyGu : Anything more to get this landed?.

TimothyGu · 2017-05-20T07:37:58Z

Nope, landed in 06a617a.

Note that for future contributions, the first line of the commit message should be under 50 characters per our commit message guidelines. I've amended the message this time, however.

Thanks again for staying through on this PR!

This commit contains three separate changes: - Always return a string from ToUnicode no matter if an error occurred. - Disable CheckHyphens boolean flag. This flag will soon be enabled in the URL Standard, but is implemented manually by selectively ignoring certain errors. - Enable CheckBidi boolean flag per URL Standard update. This allows domain names with hyphens at 3 and 4th position, as well as those with leading and trailing hyphens. They are technically invalid, but seen in the wild. Tests are updated and simplified accordingly. PR-URL: #12966 Fixes: #12965 Refs: whatwg/url#309 Reviewed-By: Daijiro Wachi <daijiro.wachi@gmail.com> Reviewed-By: Timothy Gu <timothygu99@gmail.com>

nodejs-github-bot added c++ Issues and PRs that require attention from people who are familiar with C++. i18n-api Issues and PRs related to the i18n implementation. labels May 11, 2017

TimothyGu mentioned this pull request May 11, 2017

WHATWG URL throw parse error for valid domains #12965

Closed

watilde approved these changes May 11, 2017

View reviewed changes

TimothyGu suggested changes May 11, 2017

View reviewed changes

TimothyGu added dont-land-on-v4.x whatwg-url Issues and PRs related to the WHATWG URL implementation. labels May 11, 2017

domenic mentioned this pull request May 11, 2017

IDNA whatwg/url#53

Closed

zimbabao force-pushed the fix-urls-with-hyphens-at-3-4-pos branch from c8622a2 to e97bc58 Compare May 11, 2017 17:34

TimothyGu reviewed May 11, 2017

View reviewed changes

zimbabao force-pushed the fix-urls-with-hyphens-at-3-4-pos branch from e97bc58 to 928c796 Compare May 11, 2017 19:15

zimbabao force-pushed the fix-urls-with-hyphens-at-3-4-pos branch from fdf73f6 to 69283aa Compare May 12, 2017 02:45

zimbabao added a commit to zimbabao/node that referenced this pull request May 15, 2017

url: fast path ascii domains, do not run ToASCII

8233c34

To match browser behavior fast path ascii only domains and do not run ToASCII on them. Fixes: nodejs#12965 Refs: nodejs#12966 Refs: whatwg/url#309

zimbabao added a commit to zimbabao/node that referenced this pull request May 15, 2017

url: fast path ascii domains, do not run ToASCII

a13c377

To match browser behavior fast path ascii only domains and do not run ToASCII on them. Fixes: nodejs#12965 Refs: nodejs#12966 Refs: whatwg/url#309

zimbabao added a commit to zimbabao/node that referenced this pull request May 15, 2017

url: fast path ascii domains, do not run ToASCII

d816988

To match browser behavior fast path ascii only domains and do not run ToASCII on them. Fixes: nodejs#12965 Refs: nodejs#12966 Refs: whatwg/url#309

zimbabao added a commit to zimbabao/node that referenced this pull request May 15, 2017

url: fast path ascii domains, do not run ToASCII

e86297c

To match browser behavior fast path ascii only domains and do not run ToASCII on them. Fixes: nodejs#12965 Refs: nodejs#12966 Refs: whatwg/url#309

zimbabao added a commit to zimbabao/node that referenced this pull request May 16, 2017

url: fast path ascii domains, do not run ToASCII

861604f

To match browser behavior fast path ascii only domains and do not run ToASCII on them. Fixes: nodejs#12965 Refs: nodejs#12966 Refs: whatwg/url#309

zimbabao reopened this May 18, 2017

TimothyGu mentioned this pull request May 18, 2017

Perhaps do not apply ToASCII for ASCII-only input whatwg/url#267

Closed

TimothyGu reviewed May 18, 2017

View reviewed changes

TimothyGu approved these changes May 18, 2017

View reviewed changes

zimbabao force-pushed the fix-urls-with-hyphens-at-3-4-pos branch from 8e11070 to 5187a90 Compare May 19, 2017 00:19

zimbabao added 4 commits May 18, 2017 17:25

url: ignore IDN errors when domainname have two hyphens

2ee246f

There are valid domain names with hyphens at 3 and 4th position, new node WHATWG URL parser was failing for it assume its an invalid IDN. Fixes: nodejs#12965

zimbabao force-pushed the fix-urls-with-hyphens-at-3-4-pos branch from 5187a90 to 5e0f58c Compare May 19, 2017 00:31

TimothyGu approved these changes May 20, 2017

View reviewed changes

TimothyGu closed this May 20, 2017

jasnell mentioned this pull request May 28, 2017

8.0.0 Release Proposal #12220

Closed

Uh oh!

url: ignore IDN errors when domainname have two hyphens #12966

url: ignore IDN errors when domainname have two hyphens #12966

Uh oh!

Conversation

zimbabao commented May 11, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Checklist

Affected core subsystem(s)

Uh oh!

watilde left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

domenic commented May 11, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

TimothyGu commented May 11, 2017

Uh oh!

domenic commented May 11, 2017

Uh oh!

jasnell commented May 11, 2017

Uh oh!

zimbabao commented May 12, 2017

Uh oh!

domenic commented May 12, 2017

Uh oh!

annevk commented May 12, 2017

Uh oh!

zimbabao commented May 12, 2017

Uh oh!

domenic commented May 12, 2017

Uh oh!

annevk commented May 12, 2017

Uh oh!

joyeecheung commented May 12, 2017

Uh oh!

annevk commented May 13, 2017

Uh oh!

zimbabao commented May 13, 2017

Uh oh!

zimbabao commented May 15, 2017

Uh oh!

annevk commented May 15, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TimothyGu commented May 15, 2017

Uh oh!

zimbabao commented May 18, 2017 • edited by TimothyGu Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

TimothyGu left a comment

Choose a reason for hiding this comment

Uh oh!

zimbabao commented May 19, 2017

Uh oh!

zimbabao commented May 19, 2017

Uh oh!

TimothyGu left a comment

Choose a reason for hiding this comment

Uh oh!

zimbabao commented May 20, 2017

Uh oh!

TimothyGu commented May 20, 2017

zimbabao commented May 11, 2017 •

edited

Loading

annevk commented May 15, 2017 •

edited

Loading

zimbabao commented May 18, 2017 •

edited by TimothyGu

Loading