Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

url: fix WHATWG URL host formatting error caused by port string #20493

Closed
wants to merge 4 commits into from
Closed

url: fix WHATWG URL host formatting error caused by port string #20493

wants to merge 4 commits into from

Conversation

peakji
Copy link
Contributor

@peakji peakji commented May 3, 2018

The current url.format implementation will return an invalid URL string without the host if there is a port and unicode: true.

This unexpected behavior is caused by domainToUnicode, which expects a hostname instead of a host string according to node_url.cc.

For example: url.domainToUnicode('xn--3kq375bonc.com') will be correctly converted into Chinese '搞事情.com', but url.domainToUnicode('xn--3kq375bonc.com:80') would produce an empty string.

Code to reproduce
'use strict';

const {format, URL} = require('url');

const myURL = new URL('http://examlpe.com:8080/path');

console.log(format(myURL, {unicode: false}));
// OK: 'http://examlpe.com:8080/path'

console.log(format(myURL, {unicode: true}));
// Error: 'http:///path' <-- the host is missing!
References
  1. RFC 3986 - Uniform Resource Identifier (URI): Generic Syntax
  2. RFC 3987 - Internationalized Resource Identifiers (IRIs)
  3. https://url.spec.whatwg.org/#concept-host
Checklist
  • make -j4 test (UNIX), or vcbuild test (Windows) passes
  • tests and/or benchmarks are included
  • commit message follows commit guidelines

@nodejs-github-bot nodejs-github-bot added dont-land-on-v4.x whatwg-url Issues and PRs related to the WHATWG URL implementation. labels May 3, 2018
@peakji
Copy link
Contributor Author

peakji commented May 3, 2018

This is a pretty serious bug and I'm wondering if these ambiguous names (host vs hostname) affected more stuffs in the code base?

ping @TimothyGu @jasnell @joyeecheung

@AyushG3112
Copy link
Contributor

cc @nodejs/url

@targos
Copy link
Member

targos commented May 3, 2018

Copy link
Member

@apapirovski apapirovski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic needs to be fixed slightly.

@@ -400,7 +400,9 @@ Object.defineProperties(URL.prototype, {
ret += '@';
}
ret += options.unicode ?
domainToUnicode(this.host) : this.host;
domainToUnicode(this.hostname) : this.hostname;
if (ctx.port !== '')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be ctx.port !== null?

Copy link
Contributor Author

@peakji peakji May 3, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The WHATWG URL API uses '' as the empty value, not null.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

> new url.URL("https://nodejs.org/en/")
URL {
  href: 'https://nodejs.org/en/',
  origin: 'https://nodejs.org',
  protocol: 'https:',
  username: '',
  password: '',
  host: 'nodejs.org',
  hostname: 'nodejs.org',
  port: '',
  pathname: '/en/',
  search: '',
  searchParams: URLSearchParams {},
  hash: '' }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh my bad! This is the internal context...

@@ -388,7 +388,7 @@ Object.defineProperties(URL.prototype, {
}, options);
const ctx = this[context];
var ret = ctx.scheme;
if (ctx.host !== null) {
if (ctx.hostname !== null) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should also be ctx.host !== null. The hostname change only applies below.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, URLContext uses null:

const { URL } = require('url');

const myURL = new URL('https://nodejs.org/en/');

const kContext = Object.getOwnPropertySymbols(myURL)[0];

console.log(myURL[kContext]);
URLContext {
  flags: 400,
  scheme: 'https:',
  username: '',
  password: '',
  host: 'nodejs.org',
  port: null,
  path: [ 'en', '' ],
  query: null,
  fragment: null }

@apapirovski apapirovski requested a review from TimothyGu May 3, 2018 11:10
@apapirovski
Copy link
Member

FWIW you might want to contribute your test case back to https://github.com/w3c/web-platform-tests/tree/master/url — clearly this is a case that's not covered by the official tests.

Copy link
Member

@apapirovski apapirovski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be good to go now.

@apapirovski
Copy link
Member

@apapirovski apapirovski added the author ready PRs that have at least one approval, no pending requests for changes, and a CI started. label May 3, 2018
@domenic
Copy link
Contributor

domenic commented May 3, 2018

FWIW you might want to contribute your test case back to w3c/web-platform-tests:url@master — clearly this is a case that's not covered by the official tests.

Although I agree with this general sentiment, the url.format() API is nonstandard, so it doesn't make sense to test this in the official tests.

@joyeecheung
Copy link
Member

joyeecheung commented May 3, 2018

Yeah, the unicode option of url.format is used to expose the internal "domain to Unicode" operation of WHATWG URL, in the context of that spec when the URL needs to be serialized the "domain to Unicode" operation is never run on the hostname, it's only used on the domain in URL rendering which is probably not even a thing in the context of Node.js EDIT: right it's probably a thing in say, REPLs or CLIs, or in places like electron or node-canvas.

Also as the name implies domainToUnicode runs on domains, and xn--3kq375bonc.com:80 is not a valid domain. The spec only talks about signaling validation errors and agents are encouraged to report them somewhere. In our parser we signal the error by returning an empty string.

@joyeecheung
Copy link
Member

@jasnell @nodejs/url I don't know if it makes sense to throw in the non-internal part of our JS API when the non-standard methods like domainToUnicode or domainToASCII fail? At least if we throw then this bug would be more obvious.

@jasnell
Copy link
Member

jasnell commented May 3, 2018

Throwing is fine, I think.

@trivikr
Copy link
Member

trivikr commented May 5, 2018

CI re-run for node-test-commit-arm-fanned: https://ci.nodejs.org/job/node-test-commit-arm-fanned/977/

@addaleax
Copy link
Member

addaleax commented May 6, 2018

@joyeecheung I can’t really tell whether your suggestion is something for this PR or not – should we wait with landing this or is it good to go?

@peakji
Copy link
Contributor Author

peakji commented May 7, 2018

@trivikr Could you help me to re-run node-test-commit as well?

@joyeecheung
Copy link
Member

@addaleax This PR is good to go, the error handling can be changed in another PR

@joyeecheung
Copy link
Member

CI: https://ci.nodejs.org/job/node-test-pull-request/14692/

@apapirovski
Copy link
Member

Landed in 2a96ee2

Congrats on your first commit in Node.js core and becoming a Contributor! 🎉

@apapirovski apapirovski closed this May 7, 2018
apapirovski pushed a commit that referenced this pull request May 7, 2018
The current url.format implementation will return an invalid URL string
without the host if there is a port and unicode: true.

This unexpected behavior is caused by domainToUnicode, which expects
a hostname instead of a host string according to node_url.cc.

Adds both a fix and a test for the issue.

PR-URL: #20493
Reviewed-By: Michaël Zasso <targos@protonmail.com>
Reviewed-By: Colin Ihrig <cjihrig@gmail.com>
Reviewed-By: James M Snell <jasnell@gmail.com>
Reviewed-By: Anatoli Papirovski <apapirovski@mac.com>
Reviewed-By: Trivikram Kamat <trivikr.dev@gmail.com>
Reviewed-By: Joyee Cheung <joyeec9h3@gmail.com>
Reviewed-By: Tiancheng "Timothy" Gu <timothygu99@gmail.com>
Reviewed-By: Ruben Bridgewater <ruben@bridgewater.de>
Reviewed-By: Daijiro Wachi <daijiro.wachi@gmail.com>
MylesBorins pushed a commit that referenced this pull request May 8, 2018
The current url.format implementation will return an invalid URL string
without the host if there is a port and unicode: true.

This unexpected behavior is caused by domainToUnicode, which expects
a hostname instead of a host string according to node_url.cc.

Adds both a fix and a test for the issue.

PR-URL: #20493
Reviewed-By: Michaël Zasso <targos@protonmail.com>
Reviewed-By: Colin Ihrig <cjihrig@gmail.com>
Reviewed-By: James M Snell <jasnell@gmail.com>
Reviewed-By: Anatoli Papirovski <apapirovski@mac.com>
Reviewed-By: Trivikram Kamat <trivikr.dev@gmail.com>
Reviewed-By: Joyee Cheung <joyeec9h3@gmail.com>
Reviewed-By: Tiancheng "Timothy" Gu <timothygu99@gmail.com>
Reviewed-By: Ruben Bridgewater <ruben@bridgewater.de>
Reviewed-By: Daijiro Wachi <daijiro.wachi@gmail.com>
@MylesBorins MylesBorins mentioned this pull request May 8, 2018
MylesBorins pushed a commit that referenced this pull request May 8, 2018
The current url.format implementation will return an invalid URL string
without the host if there is a port and unicode: true.

This unexpected behavior is caused by domainToUnicode, which expects
a hostname instead of a host string according to node_url.cc.

Adds both a fix and a test for the issue.

PR-URL: #20493
Reviewed-By: Michaël Zasso <targos@protonmail.com>
Reviewed-By: Colin Ihrig <cjihrig@gmail.com>
Reviewed-By: James M Snell <jasnell@gmail.com>
Reviewed-By: Anatoli Papirovski <apapirovski@mac.com>
Reviewed-By: Trivikram Kamat <trivikr.dev@gmail.com>
Reviewed-By: Joyee Cheung <joyeec9h3@gmail.com>
Reviewed-By: Tiancheng "Timothy" Gu <timothygu99@gmail.com>
Reviewed-By: Ruben Bridgewater <ruben@bridgewater.de>
Reviewed-By: Daijiro Wachi <daijiro.wachi@gmail.com>
MylesBorins pushed a commit that referenced this pull request May 9, 2018
The current url.format implementation will return an invalid URL string
without the host if there is a port and unicode: true.

This unexpected behavior is caused by domainToUnicode, which expects
a hostname instead of a host string according to node_url.cc.

Adds both a fix and a test for the issue.

PR-URL: #20493
Reviewed-By: Michaël Zasso <targos@protonmail.com>
Reviewed-By: Colin Ihrig <cjihrig@gmail.com>
Reviewed-By: James M Snell <jasnell@gmail.com>
Reviewed-By: Anatoli Papirovski <apapirovski@mac.com>
Reviewed-By: Trivikram Kamat <trivikr.dev@gmail.com>
Reviewed-By: Joyee Cheung <joyeec9h3@gmail.com>
Reviewed-By: Tiancheng "Timothy" Gu <timothygu99@gmail.com>
Reviewed-By: Ruben Bridgewater <ruben@bridgewater.de>
Reviewed-By: Daijiro Wachi <daijiro.wachi@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
author ready PRs that have at least one approval, no pending requests for changes, and a CI started. whatwg-url Issues and PRs related to the WHATWG URL implementation.
Projects
None yet
Development

Successfully merging this pull request may close these issues.