Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

url: ignore IDN errors when domainname have two hyphens #12966

Closed

Conversation

zimbabao
Copy link
Contributor

@zimbabao zimbabao commented May 11, 2017

There are valid domain names with hyphens at 3 and 4th position, new
node WHATWG URL parser was failing for it assume its an invalid IDN.

Fixes: #12965

Checklist
  • make -j4 test (UNIX), or vcbuild test (Windows) passes
  • tests and/or benchmarks are included
  • documentation is changed or added
  • commit message follows commit guidelines
Affected core subsystem(s)

@nodejs-github-bot nodejs-github-bot added c++ Issues and PRs that require attention from people who are familiar with C++. i18n-api Issues and PRs related to the i18n implementation. labels May 11, 2017
Copy link
Contributor

@watilde watilde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. This seems to be how the other URL parsers work.

@@ -487,6 +487,16 @@ const parseTests = {
path: '/bar'
},

'https://r4---sn-a5mlrn7s.gevideo.com/bar': {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of adding a test here, can you add an entry to test/fixtures/url-idna.js instead? Those cases are tested through test-whatwg-url-domainto.js. Setting mode to 'ascii' should be enough in this case.

Also, please add test cases with leading and trailing hyphens (and both) as well.

src/node_i18n.cc Outdated
@@ -516,6 +516,7 @@ int32_t ToASCII(MaybeStackBuffer<char>* buf,
info.errors &= ~UIDNA_ERROR_EMPTY_LABEL;
info.errors &= ~UIDNA_ERROR_LABEL_TOO_LONG;
info.errors &= ~UIDNA_ERROR_DOMAIN_NAME_TOO_LONG;
info.errors &= ~UIDNA_ERROR_HYPHEN_3_4;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest the following code, which filters out several more conditions mandated by the UTS#46 draft, and adds an explicit comment explaining why this is:

// These error conditions are mandated unconditionally by UTS #46 version
// 9.0.0 (rev. 17), but were found to be incompatible with many actual domain
// names in the wild. As such, in the current UTS #46 draft (rev. 18) these
// checks are made optional depending on the CheckHyphens flag, which will be
// disabled in WHATWG URL's "domain to ASCII" algorithm as soon as the UTS #46
// draft becomes standard.
// Refs:
// - https://github.com/whatwg/url/issues/53
// - http://www.unicode.org/review/pri317/
// - http://www.unicode.org/reports/tr46/tr46-18.html
info.errors &= ~UIDNA_ERROR_HYPHEN_3_4;
info.errors &= ~UIDNA_ERROR_LEADING_HYPHEN;
info.errors &= ~UIDNA_ERROR_TRAILING_HYPHEN;

I also suspect that these lines of code need to be duplicated for the ToUnicode method in the file, since there is a set of identical changes to ToUnicode in the UTS#46 draft.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Will do.
I checked those too. But was not able to find any real world example of those.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TimothyGu : Made changes as suggested. Filtering errors for trailing and leading hyphens.

@TimothyGu TimothyGu added dont-land-on-v4.x whatwg-url Issues and PRs related to the WHATWG URL implementation. labels May 11, 2017
@domenic
Copy link
Contributor

domenic commented May 11, 2017

What is the policy for landing changes that aren't spec-compliant? Or is the intention to move the spec first?

@domenic domenic mentioned this pull request May 11, 2017
@zimbabao zimbabao force-pushed the fix-urls-with-hyphens-at-3-4-pos branch from c8622a2 to e97bc58 Compare May 11, 2017 17:34
src/node_i18n.cc Outdated
@@ -516,6 +531,22 @@ int32_t ToASCII(MaybeStackBuffer<char>* buf,
info.errors &= ~UIDNA_ERROR_EMPTY_LABEL;
info.errors &= ~UIDNA_ERROR_LABEL_TOO_LONG;
info.errors &= ~UIDNA_ERROR_DOMAIN_NAME_TOO_LONG;
info.errors &= ~UIDNA_ERROR_HYPHEN_3_4;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can remove this line now that you've added my suggestion :)

// 9.0.0 (rev. 17), but were found to be incompatible with many actual domain
// names in the wild. As such, in the current UTS #46 draft (rev. 18) these
// checks are made optional depending on the CheckHyphens flag, which will be
// disabled in WHATWG URL's "domain to ASCII" algorithm as soon as the UTS #46
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

domain to Unicode

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. done all changes.

@TimothyGu
Copy link
Member

@domenic I don't think we have encountered such situations before, but generally yes I think the intention is to change spec first. In this case, however, there is a disparity between the spec and what browsers do, so I think it is justified to make this change, given that upstream is aware of the problem.

@domenic
Copy link
Contributor

domenic commented May 11, 2017

Yeah, I guess that makes sense, although with luck we can use this to prod @annevk into action on a spec fix to at least match what Node.js intends to do here.

Note that you're not matching browsers with this patch, as per whatwg/url#267 they seem to do something quite different.

@zimbabao zimbabao force-pushed the fix-urls-with-hyphens-at-3-4-pos branch from e97bc58 to 928c796 Compare May 11, 2017 19:15
@jasnell
Copy link
Member

jasnell commented May 11, 2017

Hmm... I would definitely like to see this fixed but we need to make sure that (a) the behavior matches the browsers and (b) the spec issues gets resolved. I think I'd rather hold off on landing until we get confirmation on (a) and acknowledgement that (b) will happen.

@zimbabao
Copy link
Contributor Author

@jasnell : I just created a jsbin with all above style urls and https://jsbin.com/negusoweyu/edit?html,js,console,output . Browser (I tested on chrome, Safari, Firefox). They all parse these domains and resolves them and fires request. I added entries in /etc/hosts to get them resolved.

Do you thing thats enough or any other test would be needed?. Let me know.

@zimbabao zimbabao force-pushed the fix-urls-with-hyphens-at-3-4-pos branch from fdf73f6 to 69283aa Compare May 12, 2017 02:45
@domenic
Copy link
Contributor

domenic commented May 12, 2017

To be clear, from reading browser source code and running more exhaustive tests than those simple ones, you'll find that browsers do not apply ToASCII at all for domains with only-ASCII characters. That is not what this PR does. (Nor is it what the spec does.)

See more detail at the issue whatwg/url#267 I linked to above.

@annevk
Copy link

annevk commented May 12, 2017

It's unfortunate that what browsers do doesn't make any sense. It's the kind Latin-alphabet-favoritism we try to get rid of, although maybe it's minor enough not to matter...

But yeah, if you want to match what browsers do, only run ToASCII if you found a code point > U+007F. (I think that's a per domain determination, not per domain label.)

I don't really want to change the URL standard until I heard from implementers what we can actually do here, although I should probably add a warning.

@zimbabao
Copy link
Contributor Author

@domenic , @annevk , @TimothyGu : Will look at the browser code, tests and submit a change.
From @annevk's blog https://annevankesteren.nl/2017/02/idna-toascii-differences , all tested browsers seems to agree when domain label is ascii, so not running toASCII on ASCII only string is justified.

Shall I file tickets to browser vendors for other cases?, or you are already having that discussion?.

@domenic
Copy link
Contributor

domenic commented May 12, 2017

We're already having that discussion... again, please look at the issue I've pointed to twice now :)

@annevk
Copy link

annevk commented May 12, 2017

whatwg/url#309 has some additional information.

@joyeecheung
Copy link
Member

Looks like we will need to wait for the spec for the time being. We have a "Waiting on Spec" column in the Github project for WHATWG URL...for now I'll move this one there.

@annevk
Copy link

annevk commented May 13, 2017

FWIW, if it's possible for you to iterate quickly I would recommend implementing what browsers do now ("fast path" for ASCII) and later align with any standard changes if they end up being different. It doesn't seem worth it to inconvenience end users over this.

@zimbabao
Copy link
Contributor Author

@annevk : Thats what I was planning to do. Will do it over weekend.

@zimbabao
Copy link
Contributor Author

@annevk : I tried to get the fast toASCII for all ascii domain. But corresponding ToUnicode will still fail for domain names with hyphens at 3&4 places or with trailing or leading hyphens.

We can special case this by not running ToUnicode on domain labels not having "xn--" prefix.

Will send a different PR

zimbabao added a commit to zimbabao/node that referenced this pull request May 15, 2017
To match browser behavior fast path ascii only domains and
do not run ToASCII on them.

Fixes: nodejs#12965
Refs: nodejs#12966
Refs: whatwg/url#309
@annevk
Copy link

annevk commented May 15, 2017

ToUnicode can never fail. It can only signal errors, but it never fails. See http://www.unicode.org/reports/tr46/#ToUnicode for details.

@TimothyGu
Copy link
Member

@annevk, hmm I missed that line when reading (step 2). Will note in the future.

zimbabao added a commit to zimbabao/node that referenced this pull request May 15, 2017
To match browser behavior fast path ascii only domains and
do not run ToASCII on them.

Fixes: nodejs#12965
Refs: nodejs#12966
Refs: whatwg/url#309
zimbabao added a commit to zimbabao/node that referenced this pull request May 16, 2017
To match browser behavior fast path ascii only domains and
do not run ToASCII on them.

Fixes: nodejs#12965
Refs: nodejs#12966
Refs: whatwg/url#309
@zimbabao zimbabao reopened this May 18, 2017
@zimbabao
Copy link
Contributor Author

zimbabao commented May 18, 2017

Reopening this based on this comment.

info.errors &= ~UIDNA_ERROR_HYPHEN_3_4;
info.errors &= ~UIDNA_ERROR_LEADING_HYPHEN;
info.errors &= ~UIDNA_ERROR_TRAILING_HYPHEN;

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately I was mistaken in my understanding of ToUnicode, which means that the entire block (ll. 464-485) shouldn't be necessary. While at it, can you delete this block, the erroneous || (!lenient && info.errors != 0) condition below, as well as the now-unnecessary lenient parameter?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Member

@TimothyGu TimothyGu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR LGTM except the ToUnicode issue above, but we can fix that at a later date if this PR should get in ASAP.

@zimbabao zimbabao force-pushed the fix-urls-with-hyphens-at-3-4-pos branch from 8e11070 to 5187a90 Compare May 19, 2017 00:19
zimbabao added 4 commits May 18, 2017 17:25
There are valid domain names with hyphens at 3 and 4th position, new
node WHATWG URL parser was failing for it assume its an invalid IDN.

Fixes: nodejs#12965
There are valid domain names with hyphens at 3 and 4th position, new
node WHATWG URL parser was failing for it assume its an invalid IDN.
Also filters IDN errors when domain label start or end with hyphen.

Fixes: nodejs#12965
Refs: https://www.icann.org/news/announcement-2000-01-07-en
There are valid domain names with hyphens at 3 and 4th position, new
node WHATWG URL parser was failing for it assume its an invalid IDN.
Also filters IDN errors when domain label start or end with hyphen.

Fixes: nodejs#12965
Refs: https://www.icann.org/news/announcement-2000-01-07-en
There are valid domain names with hyphens at 3 and 4th position, new
node WHATWG URL parser was failing for it assume its an invalid IDN.
Also filters IDN errors when domain label start or end with hyphen.

Also fix error in ToUnicode

Fixes: nodejs#12965
Refs: https://www.icann.org/news/announcement-2000-01-07-en
Refs: whatwg/url#309 (comment)
@zimbabao zimbabao force-pushed the fix-urls-with-hyphens-at-3-4-pos branch from 5187a90 to 5e0f58c Compare May 19, 2017 00:31
@zimbabao
Copy link
Contributor Author

We had to do hacks to get around this problems with try/catch and using older url module.

@zimbabao
Copy link
Contributor Author

@TimothyGu : I made the above changes.

Also based on commit to #309 shall we add checkHyphen param, will do that as separate PR.

Copy link
Member

@TimothyGu TimothyGu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even better! LGTM.

@zimbabao
Copy link
Contributor Author

@TimothyGu, @watilde : Can one you start a CI run for this please?.

@TimothyGu
Copy link
Member

@zimbabao
Copy link
Contributor Author

@TimothyGu : Anything more to get this landed?.

@TimothyGu
Copy link
Member

Nope, landed in 06a617a.

Note that for future contributions, the first line of the commit message should be under 50 characters per our commit message guidelines. I've amended the message this time, however.

Thanks again for staying through on this PR!

@TimothyGu TimothyGu closed this May 20, 2017
TimothyGu pushed a commit that referenced this pull request May 20, 2017
This commit contains three separate changes:

- Always return a string from ToUnicode no matter if an error occurred.
- Disable CheckHyphens boolean flag. This flag will soon be enabled in
  the URL Standard, but is implemented manually by selectively ignoring
  certain errors.
- Enable CheckBidi boolean flag per URL Standard update.

This allows domain names with hyphens at 3 and 4th position, as well as
those with leading and trailing hyphens. They are technically invalid,
but seen in the wild.

Tests are updated and simplified accordingly.

PR-URL: #12966
Fixes: #12965
Refs: whatwg/url#309
Reviewed-By: Daijiro Wachi <daijiro.wachi@gmail.com>
Reviewed-By: Timothy Gu <timothygu99@gmail.com>
@jasnell jasnell mentioned this pull request May 28, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c++ Issues and PRs that require attention from people who are familiar with C++. i18n-api Issues and PRs related to the i18n implementation. whatwg-url Issues and PRs related to the WHATWG URL implementation.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

WHATWG URL throw parse error for valid domains
8 participants