Undocumented punycode validation exception #14994

nponeccop · 2017-08-23T21:21:27Z

Version: v8.4.0
Platform: Linux 4.12.8 i686
Subsystem: http

const http = require('http')

const badUrl = 'http://xn--a--a'
const goodUrl = 'http://a--a'

const c = http.get(goodUrl, () =>
    console.log('ok')
)
.on('error', () => console.log('error'))

The code above prints error for goodUrl and throws a exception for badUrl.

However, this throwing is not explained in the documentation of http.get() and http.request() methods. Nor in Url constructor etc.

The problem seems to stem from the fact that url.parse('http://xn--a--a').hostname.codePointAt(8) is 65533, i.e. url.parse tries to decode punycode which doesn't play well with Host header validation.

_http_outgoing.js:492
    throw new TypeError('The header content contains invalid characters');
    ^

TypeError: The header content contains invalid characters
    at validateHeader (_http_outgoing.js:492:11)
    at ClientRequest.setHeader (_http_outgoing.js:496:3)
    at new ClientRequest (_http_client.js:190:12)
    at request (http.js:39:10)
    at Object.get (http.js:43:13)

The text was updated successfully, but these errors were encountered:

jasnell · 2017-08-23T21:43:29Z

ah yes... what you would need to do in this case is convert the URL to ascii first and then pass it in to http.get(). I'll take a look to see if there's a fix we can get into core tho.

jasnell · 2017-08-23T21:44:47Z

Marking this as a feature request... specifically, URLs passed in to http.get should not be converted to punycode in the host header.

nponeccop · 2017-08-23T23:10:21Z

It's other way around - I pass an ascii url and it gets converted from punycode.

nponeccop · 2017-08-24T03:12:53Z

node/lib/url.js

Lines 333 to 340 in d14c132

    
           // IDNA Support: Returns a punycoded representation of "domain". 
        
           // It only converts parts of the domain name that 
        
           // have non-ASCII characters, i.e. it doesn't matter if 
        
           // you call it with a domain that already is ASCII-only. 
        
           // Use lenient mode (`true`) to try to support even non-compliant 
        
           // URLs. 
        
           this.hostname = toASCII(this.hostname, true);

I did a little research and turns out this line of code does the opposite. toASCII comes from icu. And

process.binding('icu').toASCII('xn--a--a', true)

adds that nasty unicode character.

toASCII is supposed to be a noop (identity) for DNS labels that only contain ASCII. So we need to go deeper into icu.

TimothyGu · 2017-08-24T03:19:56Z

toASCII is supposed to be a noop (identity) for DNS labels that only contain ASCII.

We ruled against that in the URL Standard after whatwg/url#309 (comment) and whatwg/url#309 (comment).

I'll take a look.

nponeccop · 2017-08-24T03:26:07Z

To add some details, xn--a--a looks like Punycode, but it's invalid. Should toASCII() inject non-ascii chars into the output in the lenient mode? It's what it currently does.

Also, how should http.get() behave when it encounters malformed punycode? Currently it throws an exception which is technically acceptable if documented, but practically makes error handling cumbersome as we already have error events, close events and whatnot. IMHO it should treat it the same way it treats CONNECTIONREFUSED - by firing error event.

bnoordhuis · 2017-08-24T05:40:41Z

Currently it throws an exception which is technically acceptable if documented, but practically makes error handling cumbersome as we already have error events, close events and whatnot.

It's consistent with how other malformed URLs are handled (e.g. http.get('file:///boom') throws an exception) and consistent with Node.js core's general philosophy vis-a-vis error handling: throw on malformed input, emit error events for run-time errors.

It sounds like this is a documentation issue.

annevk · 2017-08-24T07:11:03Z

OP, I take it you can and do access http://xn--a--a in browsers? If so, the compatibility problem here is again browsers not universally applying ToASCII yet. Bugs have been filed to make them do so, as treating ASCII different from other character sets is unfair (and not validating xn-- as Punycode also makes little sense). However, if this kind of problem is widespread enough or browsers are not willing to change we should reconsider special casing ASCII-only input and endorse the rather silly behavior browsers have.

The frustrating thing here is that implementers from the various browser teams have been rather unresponsive to these concerns.

If this turns out to be a blocker please raise an issue over at https://github.com/whatwg/url/issues/new and I'll have another go at poking everyone.

nponeccop · 2017-08-24T12:29:09Z

It sounds like this is a documentation issue.

Yes, but only if the internal behaviour is correct. It's ok to throw an exception if documented, but now it throws because toASCII injects unicode into Host: field and header validator then throws. Isn't it more logical either to throw in the Punycode validator or not return unicode from the function named toASCII()?

OP, I take it you can and do access http://xn--a--a in browsers?

I don't care about what browsers do. In fact I don't have any target I want Node to be consistent with. And moreover you can choose whatever reaction on this URL you want as long as you find it logical and consistent. I'm not a specialist in Punycode/IDNA. So it's up to you as long as you provide enough documentation and explain rationale behind you decisions in this issue thread.

If you are interested in my opinion, I find it strange that URL normalization is done inside the HTTP client. I'd like it to be done outside to separate the concerns, so that HTTP client would only accept fully escaped URLs e.g. valid punycode in domain names and percent-style urlencoding in the URI part. But I think it's too late to change the design so you must followe the design decisions taken earlier and which I'm not aware of.

I think the logical problem is that now (after punycode and with human-readable non-ascii URI part in browsers' address bars) there are really 2 url formats: human-readable and machine-level, and url module tries to transparently recognize and normalize both.

But as I said above, I'll accept anything as a solution as long as you say it's consistent. I see 3 inconsistencies now:

The docs doesn't say that it throws
toASCII returns unicode. My understanding is that lenient mode should only return valid domain names so it should always be valid punicode in the output even if input contains errors.
The exception is thrown in a wrong place and by chance: the mistakenly injected unicode later hits Host: header validator and non-ascii is disallowed there

annevk · 2017-08-24T12:37:13Z

Your understanding about ToASCII only applies to ToUnicode. ToASCII can and will fail at times. (I agree that in theory it would be nice if the HTTP code took a parsed URL and not a string.)

nponeccop · 2017-08-24T12:44:28Z

Oh sorry. I meant "toASCII returns punicode".

Since you don't understand me let's try step by step. Open node and enter

process.binding('icu').toASCII('xn--a--a', true)

You see that nasty char in the end. If you examine it closer with process.binding('icu').toASCII('xn--a--a', true).codePointAt(8) it prints 65533.

Is is correct behaviour of toASCII()?

annevk · 2017-08-24T12:50:34Z

Not as far as I know.

nponeccop · 2017-08-24T13:41:32Z

Great! So it's the first part of the bug.

The second is that https://nodejs.org/dist/latest-v8.x/docs/api/http.html doesn't document that ClientRequest constructor throws. It does throw in many cases, for example new http.ClientRequest('file:///') throws 'Unable to determine the domain name'. So:

Throwing behaviour of http.ClientRequest is not documented
Neither it is for http.get and http.request and other places that construct ClientRequest internally
The examples neither contain full error handling nor say that more handling is required in case of untrusted or otherwise potentially invalid URLs.

Do you agree?

TimothyGu · 2017-08-24T13:54:39Z

@nponeccop Yes I would say the replacement appended to the end of the domain name is a bug. However, I'm just not sure if we can fix it, given that we delegate IDNA handling to ICU. The most we can do is to error out on it, which may break existing use cases and certain one of the tests in our test suite (which was why I added the lenient mode in the first place).

@annevk The correct behavior is to error out on invalid domain names. We don't do that currently for compatibility.

nponeccop · 2017-08-24T14:11:22Z

given that we delegate IDNA handling to ICU

Should I fill the bug to ICU then? Where is the place in the sources where you call ICU? The bug only happens in lenient mode. Does lenient mode implementation come from ICU too?

bnoordhuis · 2017-08-24T14:19:49Z

but now it throws because toASCII injects unicode into Host: field and header validator then throws

That's a good point and I take back what I said; I noticed that the ICU and non-ICU builds behave differently here as the latter uses punycode.toASCII():

> punycode.toASCII('xn--a--a')  // no exception
'xn--a--a'

> process.binding('icu').toASCII('xn--a--a')
Error: Cannot convert name to ASCII
    at repl:1:24

nponeccop · 2017-08-24T14:27:53Z

Your example also lacks lenient mode which is used in http.get/url.parse (but your point is still valid)

jasnell · 2017-08-24T14:40:50Z

I think it's also worth noting that this is a fundamental difference between url.parse() and new URL() (the former being the legacy non-standard way of handling URLs in Node.js and the latter being the WHATWG standard.

http.get() still uses the legacy url.parse() which suffers from this problem, while new URL() properly indicates that this is an invalid URL.

For instance:

> new url.URL('http://xn--a--a/testing')
TypeError: Invalid URL: http://xn--a--a/testing
    at Object.onParseError (internal/url.js:90:17)
    at parse (internal/url.js:99:11)
    at new URL (internal/url.js:193:5)
    at repl:1:1
    at ContextifyScript.Script.runInThisContext (vm.js:23:33)
    at REPLServer.defaultEval (repl.js:339:29)
    at bound (domain.js:280:14)
    at REPLServer.runBound [as eval] (domain.js:293:12)
    at REPLServer.onLine (repl.js:536:10)
    at emitOne (events.js:101:20)

> url.parse('http://xn--a--a/test')
Url {
  protocol: 'http:',
  slashes: true,
  auth: null,
  host: 'xn--a--a�',
  port: null,
  hostname: 'xn--a--a�',
  hash: null,
  search: null,
  query: null,
  pathname: '/test',
  path: '/test',
  href: 'http://xn--a--a�/test' }
>

The code in http.get() does not attempt to do any actual validation of the data after url.parse() completes ... it simply takes the hostname and tries to set it as the value of the Host header and things blow up from there.

From my perspective this really isn't an issue with ICU or IDNA, it's an issue with the fact that the URL is not valid, the legacy url.parse() is not appropriately flagging it as such, and the http.get() code is not properly looking out for it.

In the long run, I would personally prefer http.get() to use new URL() rather than url.parse() and to be more strict with invalid URLs rather than less strict. I think the WHATWG URL wg made the right decision on how to handle these cases and that that decision should not be revisited.

But that's just my opinion.

nponeccop · 2017-08-24T14:54:53Z

Yes, this is a complex issue with many intertwining parts. For example, url.parse() is not merely misses the error. It also parses it incorrectly.

jasnell · 2018-08-12T06:28:28Z

Recommend closing this due to inactivity for a year. The WHATWG URL parser should work as an alternative.

jasnell added http Issues or PRs related to the http subsystem. url Issues and PRs related to the legacy built-in url module. feature request Issues that request new features to be added to Node.js. labels Aug 23, 2017

jasnell self-assigned this Aug 23, 2017

jasnell closed this as completed Aug 12, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Undocumented punycode validation exception #14994

Undocumented punycode validation exception #14994

nponeccop commented Aug 23, 2017 •

edited

Loading

jasnell commented Aug 23, 2017

jasnell commented Aug 23, 2017

nponeccop commented Aug 23, 2017

nponeccop commented Aug 24, 2017 •

edited

Loading

TimothyGu commented Aug 24, 2017

nponeccop commented Aug 24, 2017 •

edited

Loading

bnoordhuis commented Aug 24, 2017

annevk commented Aug 24, 2017

nponeccop commented Aug 24, 2017 •

edited

Loading

annevk commented Aug 24, 2017

nponeccop commented Aug 24, 2017 •

edited

Loading

annevk commented Aug 24, 2017

nponeccop commented Aug 24, 2017 •

edited

Loading

TimothyGu commented Aug 24, 2017

nponeccop commented Aug 24, 2017

bnoordhuis commented Aug 24, 2017

nponeccop commented Aug 24, 2017 •

edited

Loading

jasnell commented Aug 24, 2017

nponeccop commented Aug 24, 2017 •

edited

Loading

jasnell commented Aug 12, 2018

Undocumented punycode validation exception #14994

Undocumented punycode validation exception #14994

Comments

nponeccop commented Aug 23, 2017 • edited Loading

jasnell commented Aug 23, 2017

jasnell commented Aug 23, 2017

nponeccop commented Aug 23, 2017

nponeccop commented Aug 24, 2017 • edited Loading

TimothyGu commented Aug 24, 2017

nponeccop commented Aug 24, 2017 • edited Loading

bnoordhuis commented Aug 24, 2017

annevk commented Aug 24, 2017

nponeccop commented Aug 24, 2017 • edited Loading

annevk commented Aug 24, 2017

nponeccop commented Aug 24, 2017 • edited Loading

annevk commented Aug 24, 2017

nponeccop commented Aug 24, 2017 • edited Loading

TimothyGu commented Aug 24, 2017

nponeccop commented Aug 24, 2017

bnoordhuis commented Aug 24, 2017

nponeccop commented Aug 24, 2017 • edited Loading

jasnell commented Aug 24, 2017

nponeccop commented Aug 24, 2017 • edited Loading

jasnell commented Aug 12, 2018

nponeccop commented Aug 23, 2017 •

edited

Loading

nponeccop commented Aug 24, 2017 •

edited

Loading

nponeccop commented Aug 24, 2017 •

edited

Loading

nponeccop commented Aug 24, 2017 •

edited

Loading

nponeccop commented Aug 24, 2017 •

edited

Loading

nponeccop commented Aug 24, 2017 •

edited

Loading

nponeccop commented Aug 24, 2017 •

edited

Loading

nponeccop commented Aug 24, 2017 •

edited

Loading