Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Undocumented punycode validation exception #14994

Closed
nponeccop opened this issue Aug 23, 2017 · 20 comments
Closed

Undocumented punycode validation exception #14994

nponeccop opened this issue Aug 23, 2017 · 20 comments
Assignees
Labels
feature request Issues that request new features to be added to Node.js. http Issues or PRs related to the http subsystem. url Issues and PRs related to the legacy built-in url module.

Comments

@nponeccop
Copy link

nponeccop commented Aug 23, 2017

  • Version: v8.4.0
  • Platform: Linux 4.12.8 i686
  • Subsystem: http
const http = require('http')

const badUrl = 'http://xn--a--a'
const goodUrl = 'http://a--a'

const c = http.get(goodUrl, () =>
    console.log('ok')
)
.on('error', () => console.log('error'))

The code above prints error for goodUrl and throws a exception for badUrl.

However, this throwing is not explained in the documentation of http.get() and http.request() methods. Nor in Url constructor etc.

The problem seems to stem from the fact that url.parse('http://xn--a--a').hostname.codePointAt(8) is 65533, i.e. url.parse tries to decode punycode which doesn't play well with Host header validation.

_http_outgoing.js:492
    throw new TypeError('The header content contains invalid characters');
    ^

TypeError: The header content contains invalid characters
    at validateHeader (_http_outgoing.js:492:11)
    at ClientRequest.setHeader (_http_outgoing.js:496:3)
    at new ClientRequest (_http_client.js:190:12)
    at request (http.js:39:10)
    at Object.get (http.js:43:13)
@jasnell
Copy link
Member

jasnell commented Aug 23, 2017

ah yes... what you would need to do in this case is convert the URL to ascii first and then pass it in to http.get(). I'll take a look to see if there's a fix we can get into core tho.

@jasnell jasnell added http Issues or PRs related to the http subsystem. url Issues and PRs related to the legacy built-in url module. feature request Issues that request new features to be added to Node.js. labels Aug 23, 2017
@jasnell
Copy link
Member

jasnell commented Aug 23, 2017

Marking this as a feature request... specifically, URLs passed in to http.get should not be converted to punycode in the host header.

@jasnell jasnell self-assigned this Aug 23, 2017
@nponeccop
Copy link
Author

It's other way around - I pass an ascii url and it gets converted from punycode.

@nponeccop
Copy link
Author

nponeccop commented Aug 24, 2017

node/lib/url.js

Lines 333 to 340 in d14c132

// IDNA Support: Returns a punycoded representation of "domain".
// It only converts parts of the domain name that
// have non-ASCII characters, i.e. it doesn't matter if
// you call it with a domain that already is ASCII-only.
// Use lenient mode (`true`) to try to support even non-compliant
// URLs.
this.hostname = toASCII(this.hostname, true);

I did a little research and turns out this line of code does the opposite. toASCII comes from icu. And

process.binding('icu').toASCII('xn--a--a', true)

adds that nasty unicode character.

toASCII is supposed to be a noop (identity) for DNS labels that only contain ASCII. So we need to go deeper into icu.

@TimothyGu
Copy link
Member

toASCII is supposed to be a noop (identity) for DNS labels that only contain ASCII.

We ruled against that in the URL Standard after whatwg/url#309 (comment) and whatwg/url#309 (comment).

I'll take a look.

@nponeccop
Copy link
Author

nponeccop commented Aug 24, 2017

To add some details, xn--a--a looks like Punycode, but it's invalid. Should toASCII() inject non-ascii chars into the output in the lenient mode? It's what it currently does.

Also, how should http.get() behave when it encounters malformed punycode? Currently it throws an exception which is technically acceptable if documented, but practically makes error handling cumbersome as we already have error events, close events and whatnot. IMHO it should treat it the same way it treats CONNECTIONREFUSED - by firing error event.

@bnoordhuis
Copy link
Member

Currently it throws an exception which is technically acceptable if documented, but practically makes error handling cumbersome as we already have error events, close events and whatnot.

It's consistent with how other malformed URLs are handled (e.g. http.get('file:///boom') throws an exception) and consistent with Node.js core's general philosophy vis-a-vis error handling: throw on malformed input, emit error events for run-time errors.

It sounds like this is a documentation issue.

@annevk
Copy link

annevk commented Aug 24, 2017

OP, I take it you can and do access http://xn--a--a in browsers? If so, the compatibility problem here is again browsers not universally applying ToASCII yet. Bugs have been filed to make them do so, as treating ASCII different from other character sets is unfair (and not validating xn-- as Punycode also makes little sense). However, if this kind of problem is widespread enough or browsers are not willing to change we should reconsider special casing ASCII-only input and endorse the rather silly behavior browsers have.

The frustrating thing here is that implementers from the various browser teams have been rather unresponsive to these concerns.

If this turns out to be a blocker please raise an issue over at https://github.com/whatwg/url/issues/new and I'll have another go at poking everyone.

@nponeccop
Copy link
Author

nponeccop commented Aug 24, 2017

It sounds like this is a documentation issue.

Yes, but only if the internal behaviour is correct. It's ok to throw an exception if documented, but now it throws because toASCII injects unicode into Host: field and header validator then throws. Isn't it more logical either to throw in the Punycode validator or not return unicode from the function named toASCII()?

OP, I take it you can and do access http://xn--a--a in browsers?

I don't care about what browsers do. In fact I don't have any target I want Node to be consistent with. And moreover you can choose whatever reaction on this URL you want as long as you find it logical and consistent. I'm not a specialist in Punycode/IDNA. So it's up to you as long as you provide enough documentation and explain rationale behind you decisions in this issue thread.

If you are interested in my opinion, I find it strange that URL normalization is done inside the HTTP client. I'd like it to be done outside to separate the concerns, so that HTTP client would only accept fully escaped URLs e.g. valid punycode in domain names and percent-style urlencoding in the URI part. But I think it's too late to change the design so you must followe the design decisions taken earlier and which I'm not aware of.

I think the logical problem is that now (after punycode and with human-readable non-ascii URI part in browsers' address bars) there are really 2 url formats: human-readable and machine-level, and url module tries to transparently recognize and normalize both.

But as I said above, I'll accept anything as a solution as long as you say it's consistent. I see 3 inconsistencies now:

  • The docs doesn't say that it throws
  • toASCII returns unicode. My understanding is that lenient mode should only return valid domain names so it should always be valid punicode in the output even if input contains errors.
  • The exception is thrown in a wrong place and by chance: the mistakenly injected unicode later hits Host: header validator and non-ascii is disallowed there

@annevk
Copy link

annevk commented Aug 24, 2017

Your understanding about ToASCII only applies to ToUnicode. ToASCII can and will fail at times. (I agree that in theory it would be nice if the HTTP code took a parsed URL and not a string.)

@nponeccop
Copy link
Author

nponeccop commented Aug 24, 2017

Oh sorry. I meant "toASCII returns punicode".

Since you don't understand me let's try step by step. Open node and enter

process.binding('icu').toASCII('xn--a--a', true)

You see that nasty char in the end. If you examine it closer with process.binding('icu').toASCII('xn--a--a', true).codePointAt(8) it prints 65533.

Is is correct behaviour of toASCII()?

@annevk
Copy link

annevk commented Aug 24, 2017

Not as far as I know.

@nponeccop
Copy link
Author

nponeccop commented Aug 24, 2017

Great! So it's the first part of the bug.

The second is that https://nodejs.org/dist/latest-v8.x/docs/api/http.html doesn't document that ClientRequest constructor throws. It does throw in many cases, for example new http.ClientRequest('file:///') throws 'Unable to determine the domain name'. So:

  • Throwing behaviour of http.ClientRequest is not documented
  • Neither it is for http.get and http.request and other places that construct ClientRequest internally
  • The examples neither contain full error handling nor say that more handling is required in case of untrusted or otherwise potentially invalid URLs.

Do you agree?

@TimothyGu
Copy link
Member

@nponeccop Yes I would say the replacement appended to the end of the domain name is a bug. However, I'm just not sure if we can fix it, given that we delegate IDNA handling to ICU. The most we can do is to error out on it, which may break existing use cases and certain one of the tests in our test suite (which was why I added the lenient mode in the first place).

@annevk The correct behavior is to error out on invalid domain names. We don't do that currently for compatibility.

@nponeccop
Copy link
Author

given that we delegate IDNA handling to ICU

Should I fill the bug to ICU then? Where is the place in the sources where you call ICU? The bug only happens in lenient mode. Does lenient mode implementation come from ICU too?

@bnoordhuis
Copy link
Member

but now it throws because toASCII injects unicode into Host: field and header validator then throws

That's a good point and I take back what I said; I noticed that the ICU and non-ICU builds behave differently here as the latter uses punycode.toASCII():

> punycode.toASCII('xn--a--a')  // no exception
'xn--a--a'

> process.binding('icu').toASCII('xn--a--a')
Error: Cannot convert name to ASCII
    at repl:1:24

@nponeccop
Copy link
Author

nponeccop commented Aug 24, 2017

Your example also lacks lenient mode which is used in http.get/url.parse (but your point is still valid)

@jasnell
Copy link
Member

jasnell commented Aug 24, 2017

I think it's also worth noting that this is a fundamental difference between url.parse() and new URL() (the former being the legacy non-standard way of handling URLs in Node.js and the latter being the WHATWG standard.

http.get() still uses the legacy url.parse() which suffers from this problem, while new URL() properly indicates that this is an invalid URL.

For instance:

> new url.URL('http://xn--a--a/testing')
TypeError: Invalid URL: http://xn--a--a/testing
    at Object.onParseError (internal/url.js:90:17)
    at parse (internal/url.js:99:11)
    at new URL (internal/url.js:193:5)
    at repl:1:1
    at ContextifyScript.Script.runInThisContext (vm.js:23:33)
    at REPLServer.defaultEval (repl.js:339:29)
    at bound (domain.js:280:14)
    at REPLServer.runBound [as eval] (domain.js:293:12)
    at REPLServer.onLine (repl.js:536:10)
    at emitOne (events.js:101:20)
> url.parse('http://xn--a--a/test')
Url {
  protocol: 'http:',
  slashes: true,
  auth: null,
  host: 'xn--a--a�',
  port: null,
  hostname: 'xn--a--a�',
  hash: null,
  search: null,
  query: null,
  pathname: '/test',
  path: '/test',
  href: 'http://xn--a--a�/test' }
>

The code in http.get() does not attempt to do any actual validation of the data after url.parse() completes ... it simply takes the hostname and tries to set it as the value of the Host header and things blow up from there.

From my perspective this really isn't an issue with ICU or IDNA, it's an issue with the fact that the URL is not valid, the legacy url.parse() is not appropriately flagging it as such, and the http.get() code is not properly looking out for it.

In the long run, I would personally prefer http.get() to use new URL() rather than url.parse() and to be more strict with invalid URLs rather than less strict. I think the WHATWG URL wg made the right decision on how to handle these cases and that that decision should not be revisited.

But that's just my opinion.

@nponeccop
Copy link
Author

nponeccop commented Aug 24, 2017

Yes, this is a complex issue with many intertwining parts. For example, url.parse() is not merely misses the error. It also parses it incorrectly.

@jasnell
Copy link
Member

jasnell commented Aug 12, 2018

Recommend closing this due to inactivity for a year. The WHATWG URL parser should work as an alternative.

@jasnell jasnell closed this as completed Aug 12, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request Issues that request new features to be added to Node.js. http Issues or PRs related to the http subsystem. url Issues and PRs related to the legacy built-in url module.
Projects
None yet
Development

No branches or pull requests

5 participants