-
Notifications
You must be signed in to change notification settings - Fork 7.3k
url.parse misinterprets hostname #8520
Comments
I've confirmed this on both v0.10.32 as well as 0.11.15-pre. Curl will hit "evil.org", not "good.com". |
Would that simply be an invalid hostname? In which case, ideas how that should be handled? |
I agree, the hostname is invalid. At the moment, during validation, the parsing function stops when it first encounters an invalid hostname part, adding everything else to the 'rest' of the string. This issue could be resolved by simply giving up when we encounter and invalid part, returning an empty object. This would also be consistent with the documentation, which promises only to return fields that existed in the query string. |
@trevnorris let's throw. |
I'm fine with throwing. Anyone want to open a PR for this? |
Was going to grab this (to get my feet wet), but I'm sure someone will do it much faster than I can get up and running with the tooling. Regarding the test above I reference and after looking at the parse code, it would seem that it's terminating the host, based on nonHostChars when I think it might be better served to use hostEndingChars instead. Also throwing here seems awkward as @cdlewis points out. It doesn't look like anywhere else in this function throws (with the exception being the type check of the argument). What about simply changing the test I reference above from this... // an unexpected invalid char in the hostname.
'HtTp://x.y.cOm*a/b/c?d=e#f g<h>i' : {
'href': 'http://x.y.com/*a/b/c?d=e#f%20g%3Ch%3Ei',
'protocol': 'http:',
'slashes': true,
'host': 'x.y.com', // <---------------
'hostname': 'x.y.com',
'pathname': '/*a/b/c',
'search': '?d=e',
'query': 'd=e',
'hash': '#f%20g%3Ch%3Ei',
'path': '/*a/b/c?d=e'
}, To this. // an unexpected invalid char in the hostname.
'HtTp://x.y.cOm*a/b/c?d=e#f g<h>i' : {
'href': 'http://x.y.com*a/b/c?d=e#f%20g%3Ch%3Ei',
'protocol': 'http:',
'slashes': true,
'host': 'x.y.com*a', // <----------------
'hostname': 'x.y.com*a',
'pathname': '/b/c',
'search': '?d=e',
'query': 'd=e',
'hash': '#f%20g%3Ch%3Ei',
'path': '/b/c?d=e'
}, It seems outside the scope to check the validity of the host, but to try to accurately extract it from what's provided. This would also solve the OP's problem because then the host would be |
No, this is invalid and should not happen in |
A little confused by the response, maybe add a little detail? I just think throwing is a bad idea. This would make the common usage of |
Another option would be to make hostname empty (similar to how hostnames over the max length are handled). |
I tend to agree with @jondavidjohn |
Regarding nodejs#8520 url.parse misinterpreted `https://good.com+.evil.org/` as `https://good.com/+.evil.org/`. If we use url.parse to check the validity of the hostname, the test passes, but in the browser the user is redirected to the evil.org website.
I'd be fine if we added |
I don't know why we'd expand this fix into API changing territory. Throwing here would cause existing code to behave differently. Having a way to pre-check validity of the input is good ( I don't think requiring if (url.isValid(str)) {
var parsed = url.parse(str);
} everywhere, is a lot different than requiring try {
var parsed = url.parse(str);
} catch (e) {} If we decided to throw in this particular spot, for the sake of consistency we'll also want to throw when the hostname is too long also. So, have you reviewed my pull above? Do I need to change it to the throwing approach instead? And if so, would it need to be targeted at master? |
As for changes to the current |
I feel like my questions / feedback are not being read. I understand this is a little insignificant issue in the scheme of things and likely has a fraction of your attention, I'm just trying to fix the reporter's obvious issue consistently within the constraints of the v0.10.x API expectations. Is this the wrong approach? |
-1 for throwing. That's a huge and backward-incompatible change to the calling convention. I like the idea of blanking the hostname for an invalid hostname; perhaps having an "invalid" sub-object with the bad parts assigned to it; or perhaps just drop them. |
Following @Quard comment and the fact that Curl hits |
Definitely not going to break backwards compatibility in v0.10. We can discuss what we want to do going forward though. I'm not convinced throwing is the right approach in general, since it could possibly break a lot of user's existing code. Now, as for the results that @Qard showed. It looks like Chrome simply eats the bad URL. Not sure how I feel about that either. |
@trevnorris do you have any specific issue with the approach of simply returning an empty host/hostname field when an invalid hostname part is detected? This seems to be the safest "path of least surprise" option in my mind since this is how we currently handle hostnames that are too long. |
@trevnorris For what it's worth, Firefox works the same. The browsers are at least consistent in their wrongness. |
I'm come around to the opinion that the Chrome 38/Firefox output that @Qard showed would be the desired output of |
@chrisdickinson That would would be my first choice. I can implement it this way instead, just need to wait and see which way yall want to fix it. |
In reading the host parsing section of the spec, I see no mention of + being an invalid symbol in the host portion of the string. https://url.spec.whatwg.org/#host-parsing Is there somewhere it is documented as being invalid? |
@tjfontaine @indutny Thoughts on having |
@Qard ah, you're right. |
@trevnorris So you guys would like a pull to align it with the browser behavior noted above by @Qard then? Tests passing looking like this? // an unexpected invalid char in the hostname.
'http://x.y.com+a/b/c' : {
'href': 'http://x.y.com+a/b/c',
'protocol': 'http:',
'slashes': true,
'host': 'x.y.com+a', // <---------------
'hostname': 'x.y.com+a',
'pathname': '/b/c',
'search': '',
'query': '',
'hash': '',
'path': '/b/c'
}, |
@jondavidjohn I don't care much about matching browser behavior, but following the spec. In this specific case |
Regarding nodejs#8520 url.parse misinterpreted `https://good.com+.evil.org/` as `https://good.com/+.evil.org/`. If we use url.parse to check the validity of the hostname, the test passes, but in the browser the user is redirected to the evil.org website.
@trevnorris PR is up, let me know if it's not what you had in mind. |
Regarding nodejs#8520 url.parse misinterpreted `https://good.com+.evil.org/` as `https://good.com/+.evil.org/`. If we use url.parse to check the validity of the hostname, the test passes, but in the browser the user is redirected to the evil.org website.
Regarding nodejs#8520 This approach changes hostname validation from a whitelist approach to a blacklist approach as described in https://url.spec.whatwg.org/#host-parsing. url.parse misinterpreted `https://good.com+.evil.org/` as `https://good.com/+.evil.org/`. If we use url.parse to check the validity of the hostname, the test passes, but in the browser the user is redirected to the evil.org website.
Regarding nodejs#8520 This approach changes hostname validation from a whitelist approach to a blacklist approach as described in https://url.spec.whatwg.org/#host-parsing. url.parse misinterpreted `https://good.com+.evil.org/` as `https://good.com/+.evil.org/`. If we use url.parse to check the validity of the hostname, the test passes, but in the browser the user is redirected to the evil.org website.
Regarding nodejs#8520 This changes hostname validation from a whitelist regex approach to a blacklist regex approach as described in https://url.spec.whatwg.org/#host-parsing. url.parse misinterpreted `https://good.com+.evil.org/` as `https://good.com/+.evil.org/`. If we use url.parse to check the validity of the hostname, the test passes, but in the browser the user is redirected to the evil.org website.
Regarding #8520 This changes hostname validation from a whitelist regex approach to a blacklist regex approach as described in https://url.spec.whatwg.org/#host-parsing. url.parse misinterpreted `https://good.com+.evil.org/` as `https://good.com/+.evil.org/`. If we use url.parse to check the validity of the hostname, the test passes, but in the browser the user is redirected to the evil.org website.
Regarding nodejs/node-v0.x-archive#8520 This changes hostname validation from a whitelist regex approach to a blacklist regex approach as described in https://url.spec.whatwg.org/#host-parsing. url.parse misinterpreted `https://good.com+.evil.org/` as `https://good.com/+.evil.org/`. If we use url.parse to check the validity of the hostname, the test passes, but in the browser the user is redirected to the evil.org website.
Regarding nodejs/node-v0.x-archive#8520 This changes hostname validation from a whitelist regex approach to a blacklist regex approach as described in https://url.spec.whatwg.org/#host-parsing. url.parse misinterpreted `https://good.com+.evil.org/` as `https://good.com/+.evil.org/`. If we use url.parse to check the validity of the hostname, the test passes, but in the browser the user is redirected to the evil.org website.
'+' is considered now considered a valid host character
This seems fixed in Node.js v0.12. $ nvm use 0.10
Now using node v0.10.38
$ node -e "var parsed = require('url').parse('https://good.com+.evil.com/'); console.log(parsed.host);"
good.com
$ nvm use 0.12
Now using node v0.12.0
$ node -e "var parsed = require('url').parse('https://good.com+.evil.com/'); console.log(parsed.host);"
good.com+.evil.com |
@joyent/node-coreteam ... this is fixed in v0.12, but not in v0.10. Do we want to backport? |
See: nodejs/node#49 |
Node v0.10.32
returns
url.parse
misinterpretedhttps://good.com+.evil.org/
ashttps://good.com/+.evil.org/
If we use
url.parse
to check the validity of the hostname, the test passes, but in the browser the user is redirected to theevil.org
website.Other characters than
+
might do the trick too.The text was updated successfully, but these errors were encountered: