url.parse misinterprets hostname #8520

pierre-elie · 2014-10-06T14:36:31Z

Node v0.10.32

url.parse('https://good.com+.evil.org/?accessToken=xxx');

returns

{ 
  protocol: 'https:',
  slashes: true,
  auth: null,
  host: 'good.com',
  port: null,
  hostname: 'good.com',
  hash: null,
  search: '?accessToken=xxx',
  query: 'accessToken=xxx',
  pathname: '/+.evil.org/',
  path: '/+.evil.org/?accessToken=xxx',
  href: 'https://good.com/+.evil.org/?accessToken=xxx'
}

url.parse misinterpreted https://good.com+.evil.org/ as https://good.com/+.evil.org/
If we use url.parse to check the validity of the hostname, the test passes, but in the browser the user is redirected to the evil.org website.

Other characters than + might do the trick too.

The text was updated successfully, but these errors were encountered:

chrisdickinson · 2014-10-06T16:34:23Z

I've confirmed this on both v0.10.32 as well as 0.11.15-pre. Curl will hit "evil.org", not "good.com".

trevnorris · 2014-10-07T22:46:29Z

Would that simply be an invalid hostname? In which case, ideas how that should be handled?

cdlewis · 2014-10-18T18:09:16Z

I agree, the hostname is invalid. At the moment, during validation, the parsing function stops when it first encounters an invalid hostname part, adding everything else to the 'rest' of the string. This issue could be resolved by simply giving up when we encounter and invalid part, returning an empty object. This would also be consistent with the documentation, which promises only to return fields that existed in the query string.

indutny · 2014-10-20T09:44:45Z

@trevnorris let's throw.

trevnorris · 2014-10-23T05:37:13Z

I'm fine with throwing. Anyone want to open a PR for this?

jondavidjohn · 2014-10-24T04:16:36Z

Relevant? https://github.com/joyent/node/blob/master/test/simple/test-url.js#L156-168

jondavidjohn · 2014-10-24T19:20:42Z

Was going to grab this (to get my feet wet), but I'm sure someone will do it much faster than I can get up and running with the tooling. Regarding the test above I reference and after looking at the parse code, it would seem that it's terminating the host, based on nonHostChars when I think it might be better served to use hostEndingChars instead.

Also throwing here seems awkward as @cdlewis points out. It doesn't look like anywhere else in this function throws (with the exception being the type check of the argument).

What about simply changing the test I reference above from this...

  // an unexpected invalid char in the hostname.
  'HtTp://x.y.cOm*a/b/c?d=e#f g<h>i' : {
    'href': 'http://x.y.com/*a/b/c?d=e#f%20g%3Ch%3Ei',
    'protocol': 'http:',
    'slashes': true,
    'host': 'x.y.com',   // <---------------
    'hostname': 'x.y.com',
    'pathname': '/*a/b/c',
    'search': '?d=e',
    'query': 'd=e',
    'hash': '#f%20g%3Ch%3Ei',
    'path': '/*a/b/c?d=e'
  },

To this.

  // an unexpected invalid char in the hostname.
  'HtTp://x.y.cOm*a/b/c?d=e#f g<h>i' : {
    'href': 'http://x.y.com*a/b/c?d=e#f%20g%3Ch%3Ei',
    'protocol': 'http:',
    'slashes': true,
    'host': 'x.y.com*a',   // <----------------
    'hostname': 'x.y.com*a',
    'pathname': '/b/c',
    'search': '?d=e',
    'query': 'd=e',
    'hash': '#f%20g%3Ch%3Ei',
    'path': '/b/c?d=e'
  },

It seems outside the scope to check the validity of the host, but to try to accurately extract it from what's provided. This would also solve the OP's problem because then the host would be good.com+.evil.org which would be an obvious mismatch.

indutny · 2014-10-24T19:27:30Z

No, this is invalid and should not happen in url.

jondavidjohn · 2014-10-24T19:36:39Z

A little confused by the response, maybe add a little detail?

I just think throwing is a bad idea.

This would make the common usage of url.parse() to need a try/catch, since it's often you need this function to specifically parsing unknown input. A main line usage doesn't seem like a good candidate for throwing in my opinion.

jondavidjohn · 2014-10-24T20:04:25Z

Another option would be to make hostname empty (similar to how hostnames over the max length are handled).

pierre-elie · 2014-10-24T20:07:25Z

I tend to agree with @jondavidjohn

Regarding nodejs#8520 url.parse misinterpreted `https://good.com+.evil.org/` as `https://good.com/+.evil.org/`. If we use url.parse to check the validity of the hostname, the test passes, but in the browser the user is redirected to the evil.org website.

trevnorris · 2014-10-25T07:28:42Z

I'd be fine if we added url.valid(string) to the API. That way the URL could go without throwing.

jondavidjohn · 2014-10-25T13:46:20Z

I don't know why we'd expand this fix into API changing territory.

Throwing here would cause existing code to behave differently.

Having a way to pre-check validity of the input is good (JSON.parse anyone? 😠), but I think adding a method, and changing behavior of an existing one is a bit over the top when my pull above addresses the issue while maintaining current expected behavior for programs written today.

I don't think requiring

if (url.isValid(str)) {
  var parsed = url.parse(str);
}

everywhere, is a lot different than requiring

try {
  var parsed = url.parse(str);
} catch (e) {}

If we decided to throw in this particular spot, for the sake of consistency we'll also want to throw when the hostname is too long also.

So, have you reviewed my pull above? Do I need to change it to the throwing approach instead? And if so, would it need to be targeted at master?

trevnorris · 2014-10-27T21:38:41Z

As for changes to the current url.parse(), I think we should make sure it fully conforms to the whatwg spec (https://url.spec.whatwg.org/). Then determine what should happen with anything designated a "parse error".

jondavidjohn · 2014-10-31T22:14:12Z

I feel like my questions / feedback are not being read.

I understand this is a little insignificant issue in the scheme of things and likely has a fraction of your attention, I'm just trying to fix the reporter's obvious issue consistently within the constraints of the v0.10.x API expectations.

Is this the wrong approach?

aredridel · 2014-11-05T00:48:10Z

-1 for throwing. That's a huge and backward-incompatible change to the calling convention.

I like the idea of blanking the hostname for an invalid hostname; perhaps having an "invalid" sub-object with the bad parts assigned to it; or perhaps just drop them.

Qard · 2014-11-05T01:00:35Z

For reference, this is how new URL(...) behaves in Chrome 38.

pierre-elie · 2014-11-05T01:07:29Z

Following @Quard comment and the fact that Curl hits evil.com: is the hostname really invalid or do Curl and Chrome have an issue?

trevnorris · 2014-11-05T11:47:52Z

Definitely not going to break backwards compatibility in v0.10. We can discuss what we want to do going forward though.

I'm not convinced throwing is the right approach in general, since it could possibly break a lot of user's existing code.

Now, as for the results that @Qard showed. It looks like Chrome simply eats the bad URL. Not sure how I feel about that either.

jondavidjohn · 2014-11-05T15:50:28Z

@trevnorris do you have any specific issue with the approach of simply returning an empty host/hostname field when an invalid hostname part is detected? This seems to be the safest "path of least surprise" option in my mind since this is how we currently handle hostnames that are too long.

Qard · 2014-11-05T16:00:09Z

@trevnorris For what it's worth, Firefox works the same. The browsers are at least consistent in their wrongness.

chrisdickinson · 2014-11-05T18:36:40Z

I'm come around to the opinion that the Chrome 38/Firefox output that @Qard showed would be the desired output of url.parse(https://good.com+.evil.org/?accessToken=xxx) -- while it does put the onus on the developer to double check that the host is appropriate, at least by returning good.com+.evil.org the user can make that determination (vs. our current "return good.com" behavior). Returning nothing would be very surprising, as would throwing an exception.

jondavidjohn · 2014-11-05T18:57:33Z

@chrisdickinson That would would be my first choice. I can implement it this way instead, just need to wait and see which way yall want to fix it.

Qard · 2014-11-05T18:59:24Z

In reading the host parsing section of the spec, I see no mention of + being an invalid symbol in the host portion of the string. https://url.spec.whatwg.org/#host-parsing

Is there somewhere it is documented as being invalid?

trevnorris · 2014-11-06T00:42:45Z

@tjfontaine @indutny Thoughts on having parse() work the same as it does in the browser? (e.g. output matches that of new URL())

trevnorris · 2014-11-06T00:44:29Z

@Qard ah, you're right. + isn't an invalid character. In that case I would consider it a bug on the part of parse().

jondavidjohn · 2014-11-16T02:22:47Z

@trevnorris So you guys would like a pull to align it with the browser behavior noted above by @Qard then?

Tests passing looking like this?

  // an unexpected invalid char in the hostname.
  'http://x.y.com+a/b/c' : {
    'href': 'http://x.y.com+a/b/c',
    'protocol': 'http:',
    'slashes': true,
    'host': 'x.y.com+a',   // <---------------
    'hostname': 'x.y.com+a',
    'pathname': '/b/c',
    'search': '',
    'query': '',
    'hash': '',
    'path': '/b/c'
  },

trevnorris · 2014-11-17T20:27:36Z

@jondavidjohn I don't care much about matching browser behavior, but following the spec. In this specific case '+' isn't an invalid character, thus should be left alone. But yes on the PR. Make sure to mention what @Qard pointed out in https://url.spec.whatwg.org/#host-parsing.

Regarding nodejs#8520 url.parse misinterpreted `https://good.com+.evil.org/` as `https://good.com/+.evil.org/`. If we use url.parse to check the validity of the hostname, the test passes, but in the browser the user is redirected to the evil.org website.

jondavidjohn · 2014-11-26T03:47:42Z

@trevnorris PR is up, let me know if it's not what you had in mind.

Regarding nodejs#8520 url.parse misinterpreted `https://good.com+.evil.org/` as `https://good.com/+.evil.org/`. If we use url.parse to check the validity of the hostname, the test passes, but in the browser the user is redirected to the evil.org website.

Regarding nodejs#8520 This approach changes hostname validation from a whitelist approach to a blacklist approach as described in https://url.spec.whatwg.org/#host-parsing. url.parse misinterpreted `https://good.com+.evil.org/` as `https://good.com/+.evil.org/`. If we use url.parse to check the validity of the hostname, the test passes, but in the browser the user is redirected to the evil.org website.

Regarding nodejs#8520 This changes hostname validation from a whitelist regex approach to a blacklist regex approach as described in https://url.spec.whatwg.org/#host-parsing. url.parse misinterpreted `https://good.com+.evil.org/` as `https://good.com/+.evil.org/`. If we use url.parse to check the validity of the hostname, the test passes, but in the browser the user is redirected to the evil.org website.

Regarding #8520 This changes hostname validation from a whitelist regex approach to a blacklist regex approach as described in https://url.spec.whatwg.org/#host-parsing. url.parse misinterpreted `https://good.com+.evil.org/` as `https://good.com/+.evil.org/`. If we use url.parse to check the validity of the hostname, the test passes, but in the browser the user is redirected to the evil.org website.

Regarding nodejs/node-v0.x-archive#8520 This changes hostname validation from a whitelist regex approach to a blacklist regex approach as described in https://url.spec.whatwg.org/#host-parsing. url.parse misinterpreted `https://good.com+.evil.org/` as `https://good.com/+.evil.org/`. If we use url.parse to check the validity of the hostname, the test passes, but in the browser the user is redirected to the evil.org website.

'+' is considered now considered a valid host character

mathiasbynens · 2015-04-17T06:36:46Z

This seems fixed in Node.js v0.12.

$ nvm use 0.10
Now using node v0.10.38

$ node -e "var parsed = require('url').parse('https://good.com+.evil.com/'); console.log(parsed.host);"
good.com

$ nvm use 0.12
Now using node v0.12.0

$ node -e "var parsed = require('url').parse('https://good.com+.evil.com/'); console.log(parsed.host);"
good.com+.evil.com

jasnell · 2015-06-03T20:12:44Z

@joyent/node-coreteam ... this is fixed in v0.12, but not in v0.10. Do we want to backport?

jasnell · 2015-06-03T23:18:36Z

See: nodejs/node#49

chrisdickinson added url S-confirmed-bug labels Oct 6, 2014

jondavidjohn mentioned this issue Oct 24, 2014

url: Urls containing invalid hostname return blank #8614

Closed

jondavidjohn mentioned this issue Nov 26, 2014

url: change hostname regex to negate invalid chars #8782

Closed

jondavidjohn mentioned this issue Nov 27, 2014

url: change hostname regex to negate invalid chars #8787

Closed

petkaantonov added a commit to petkaantonov/urlparser that referenced this issue Jan 8, 2015

Fixes nodejs/node-v0.x-archive#8520

1f34819

'+' is considered now considered a valid host character

jasnell mentioned this issue Jun 3, 2015

url.parse and url.format is non-inversible for urls with invalid characters in hostnames. #8636

Closed

jasnell mentioned this issue Jun 1, 2016

proposal: WHATWG URL standard implementation nodejs/node-eps#28

Closed

Trott closed this as completed Apr 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

url.parse misinterprets hostname #8520

url.parse misinterprets hostname #8520

pierre-elie commented Oct 6, 2014

chrisdickinson commented Oct 6, 2014

trevnorris commented Oct 7, 2014

cdlewis commented Oct 18, 2014

indutny commented Oct 20, 2014

trevnorris commented Oct 23, 2014

jondavidjohn commented Oct 24, 2014

jondavidjohn commented Oct 24, 2014

indutny commented Oct 24, 2014

jondavidjohn commented Oct 24, 2014

jondavidjohn commented Oct 24, 2014

pierre-elie commented Oct 24, 2014

trevnorris commented Oct 25, 2014

jondavidjohn commented Oct 25, 2014

trevnorris commented Oct 27, 2014

jondavidjohn commented Oct 31, 2014

aredridel commented Nov 5, 2014

Qard commented Nov 5, 2014

pierre-elie commented Nov 5, 2014

trevnorris commented Nov 5, 2014

jondavidjohn commented Nov 5, 2014

Qard commented Nov 5, 2014

chrisdickinson commented Nov 5, 2014

jondavidjohn commented Nov 5, 2014

Qard commented Nov 5, 2014

trevnorris commented Nov 6, 2014

trevnorris commented Nov 6, 2014

jondavidjohn commented Nov 16, 2014

trevnorris commented Nov 17, 2014

jondavidjohn commented Nov 26, 2014

mathiasbynens commented Apr 17, 2015

jasnell commented Jun 3, 2015

jasnell commented Jun 3, 2015

url.parse misinterprets hostname #8520

url.parse misinterprets hostname #8520

Comments

pierre-elie commented Oct 6, 2014

chrisdickinson commented Oct 6, 2014

trevnorris commented Oct 7, 2014

cdlewis commented Oct 18, 2014

indutny commented Oct 20, 2014

trevnorris commented Oct 23, 2014

jondavidjohn commented Oct 24, 2014

jondavidjohn commented Oct 24, 2014

indutny commented Oct 24, 2014

jondavidjohn commented Oct 24, 2014

jondavidjohn commented Oct 24, 2014

pierre-elie commented Oct 24, 2014

trevnorris commented Oct 25, 2014

jondavidjohn commented Oct 25, 2014

trevnorris commented Oct 27, 2014

jondavidjohn commented Oct 31, 2014

aredridel commented Nov 5, 2014

Qard commented Nov 5, 2014

pierre-elie commented Nov 5, 2014

trevnorris commented Nov 5, 2014

jondavidjohn commented Nov 5, 2014

Qard commented Nov 5, 2014

chrisdickinson commented Nov 5, 2014

jondavidjohn commented Nov 5, 2014

Qard commented Nov 5, 2014

trevnorris commented Nov 6, 2014

trevnorris commented Nov 6, 2014

jondavidjohn commented Nov 16, 2014

trevnorris commented Nov 17, 2014

jondavidjohn commented Nov 26, 2014

mathiasbynens commented Apr 17, 2015

jasnell commented Jun 3, 2015

jasnell commented Jun 3, 2015