Skip to content

Conversation

@TimothyGu
Copy link
Member

Currently, the ICU-based IDNA conversion methods only return errors on those passed along through a UErrorCode. However, according to ICU's documentation for uidna_nameToASCII(),

If any processing step fails, then pInfo->errors will be non-zero and the result might not be an ASCII string. The domain name might be modified according to the types of errors. Labels with severe errors will be left in (or turned into) their Unicode form.

The UErrorCode indicates an error only in exceptional cases, such as a U_MEMORY_ALLOCATION_ERROR.

In other words, when non-catastrophically invalid domains are passed, ToASCII() and ToUnicode() (and their downstream url.domainToASCII() and url.domainToUnicode()) currently return garbled domain names instead of errors.

This PR makes the C++ binding methods report errors in pInfo->errors in addition to UErrorCode, thereby fixing those aforementioned problems.

Also included in this PR are additional tests for invalid situations as well as documentation clarifications for the user-facing url.domainToASCII() and url.domainToUnicode().

Before vs. after
> url.domainToASCII('\ufffd.com')
'�.com'
> url.domainToUnicode('xn---\x03.com')
'xn---\u0003.com'
> process.binding('icu').toASCII('\ufffd.com')
'�.com'
> process.binding('icu').toUnicode('xn---\x03.com')
'xn---\u0003.com'
> process.binding('icu').toUnicode('xn--- .com')
'xn--- .com'
> url.domainToASCII('\ufffd.com')
''
> url.domainToUnicode('xn---\x03.com')
''
> process.binding('icu').toASCII('\ufffd.com')
Error: Cannot convert name to ASCII
    at repl:1:24
> process.binding('icu').toUnicode('xn---\x03.com')
Error: Cannot convert name to Unicode
    at repl:1:24
> process.binding('icu').toUnicode('xn--- .com')
Error: Cannot convert name to Unicode
    at repl:1:24
Checklist
  • make -j4 test (UNIX), or vcbuild test (Windows) passes
  • tests and/or benchmarks are included
  • documentation is changed or added
  • commit message follows commit guidelines
Affected core subsystem(s)

@nodejs-github-bot nodejs-github-bot added c++ Issues and PRs that require attention from people who are familiar with C++. i18n-api Issues and PRs related to the i18n implementation. labels Feb 25, 2017
@TimothyGu TimothyGu added the whatwg-url Issues and PRs related to the WHATWG URL implementation. label Feb 25, 2017
@TimothyGu
Copy link
Member Author

@TimothyGu
Copy link
Member Author

TimothyGu commented Feb 25, 2017

Hopefully the issue with legacy url parser is fixed.

/cc @nodejs/intl @nodejs/url

New CI: https://ci.nodejs.org/job/node-test-pull-request/6586/

@TimothyGu TimothyGu added the url Issues and PRs related to the legacy built-in url module. label Feb 25, 2017
@TimothyGu TimothyGu changed the title src: bail on IDNA conversion error src: do not ignore IDNA conversion error Feb 25, 2017
doc/api/url.md Outdated
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw, should this be deserialization, and mention that it is the inverse of domainToASCII?

Copy link
Member Author

@TimothyGu TimothyGu Feb 25, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is serialization, since the domain is fully parsed and subsequently serialized from the parsed form. It's just that it uses a different algorithm for deserialization.

src/node_i18n.cc Outdated
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you update the args.Length() check above to use 2? Also, you probably want to add a CHECK(args[1]->IsBoolean()); or do args[1]->BooleanValue() instead.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't update the check for argument length, since (as the comment is trying to say) it is an optional argument, so that existing usage of toUnicode(str) would still work. V8 automatically returns an Undefined for out-of-range args[] dereference.

Wasn't aware of BooleanValue(). Will use that instead.

src/node_i18n.cc Outdated
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(ditto)

src/node_i18n.cc Outdated
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this error part of any non-experimental API? Could we change it to Cannot encode name to ASCII as Punycode?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes for toASCII

> url.parse(`http://${'é'.repeat(230)}.com/`)
Error: Cannot convert name to ASCII

src/node_i18n.cc Outdated
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this error part of any non-experimental API? Could we change it to Cannot decode name as Punycode? (basically the same question I also posted below).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No; in fact the toUnicode JS function isn't used in the code base at all. Maybe we should just remove this method?

/cc @jasnell

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's not used, it can be removed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove which function specifically? The `i18n::ToUnicode' function is definitely used.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jasnell, the exposed process.binding('icu').toUnicode() JS function.

src/node_i18n.cc Outdated
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this compile? Seems like the env->context() argument is missing

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@addaleax, you are right. Forgot to push fde77b3

src/node_i18n.cc Outdated
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's not used, it can be removed.

Copy link
Member

@joyeecheung joyeecheung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should also fix the missing errors when parsing percent-encoded disallowed characters in hosts(https://github.com/nodejs/node/blob/master/test/fixtures/url-tests.js#L4499) since we are no longer ignoring UIDNA_ERROR_DISALLOWED, you can turn them on in this PR if you like.

@jasnell
Copy link
Member

jasnell commented Feb 27, 2017

@TimothyGu
Copy link
Member Author

@jasnell, did you see #11549 (comment)?

Old behavior can be restored using a special `lenient` mode.
- Split the tests out to a separate file
- Add invalid cases
- Add tests for url.domainTo*()
- Re-enable previously broken WPT URL parsing tests
@TimothyGu
Copy link
Member Author

Test re-enabled per @joyeecheung. Will land tomorrow.

CI: https://ci.nodejs.org/job/node-test-pull-request/6619/

@TimothyGu
Copy link
Member Author

TimothyGu commented Mar 1, 2017

Landed in a520508...7ceea2a. toUnicode() isn't removed for now, since it provides equivalence to the punycode module, and though unused in the code base it is well-tested.

TimothyGu added a commit that referenced this pull request Mar 1, 2017
Old behavior can be restored using a special `lenient` mode, as used in
the legacy URL parser.

PR-URL: #11549
Reviewed-By: Anna Henningsen <anna@addaleax.net>
Reviewed-By: Ben Noordhuis <info@bnoordhuis.nl>
Reviewed-By: James M Snell <jasnell@gmail.com>
Reviewed-By: Joyee Cheung <joyeec9h3@gmail.com>
TimothyGu added a commit that referenced this pull request Mar 1, 2017
- Split the tests out to a separate file
- Add invalid cases
- Add tests for url.domainTo*()
- Re-enable previously broken WPT URL parsing tests

PR-URL: #11549
Reviewed-By: Anna Henningsen <anna@addaleax.net>
Reviewed-By: Ben Noordhuis <info@bnoordhuis.nl>
Reviewed-By: James M Snell <jasnell@gmail.com>
Reviewed-By: Joyee Cheung <joyeec9h3@gmail.com>
TimothyGu added a commit that referenced this pull request Mar 1, 2017
PR-URL: #11549
Reviewed-By: Anna Henningsen <anna@addaleax.net>
Reviewed-By: Ben Noordhuis <info@bnoordhuis.nl>
Reviewed-By: James M Snell <jasnell@gmail.com>
Reviewed-By: Joyee Cheung <joyeec9h3@gmail.com>
@TimothyGu TimothyGu closed this Mar 1, 2017
@TimothyGu TimothyGu deleted the idna branch March 1, 2017 02:33
addaleax pushed a commit that referenced this pull request Mar 5, 2017
Old behavior can be restored using a special `lenient` mode, as used in
the legacy URL parser.

PR-URL: #11549
Reviewed-By: Anna Henningsen <anna@addaleax.net>
Reviewed-By: Ben Noordhuis <info@bnoordhuis.nl>
Reviewed-By: James M Snell <jasnell@gmail.com>
Reviewed-By: Joyee Cheung <joyeec9h3@gmail.com>
addaleax pushed a commit that referenced this pull request Mar 5, 2017
- Split the tests out to a separate file
- Add invalid cases
- Add tests for url.domainTo*()
- Re-enable previously broken WPT URL parsing tests

PR-URL: #11549
Reviewed-By: Anna Henningsen <anna@addaleax.net>
Reviewed-By: Ben Noordhuis <info@bnoordhuis.nl>
Reviewed-By: James M Snell <jasnell@gmail.com>
Reviewed-By: Joyee Cheung <joyeec9h3@gmail.com>
addaleax pushed a commit that referenced this pull request Mar 5, 2017
PR-URL: #11549
Reviewed-By: Anna Henningsen <anna@addaleax.net>
Reviewed-By: Ben Noordhuis <info@bnoordhuis.nl>
Reviewed-By: James M Snell <jasnell@gmail.com>
Reviewed-By: Joyee Cheung <joyeec9h3@gmail.com>
@evanlucas evanlucas mentioned this pull request Mar 8, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

c++ Issues and PRs that require attention from people who are familiar with C++. i18n-api Issues and PRs related to the i18n implementation. url Issues and PRs related to the legacy built-in url module. whatwg-url Issues and PRs related to the WHATWG URL implementation.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants