You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
aiohttp's client handle IDNA hostnames in a way that seems inconsistent: the Host header always contains a dedcoded utf-8 value which seems problematic.
For instance:
session.get("http://éé.com/") makes a request with Host: éé.com
session.get("http://xn--9caa.com/") also makes a request with Host: éé.com.
While it's unclear to me if an unicode hostname should always be IDNA encoded (see bellow), it should at least not be decoded when explicitly encoded by the caller.
Finally the user agent sends the request for the page. Since punycode contains no characters outside those normally allowed for protocols such as HTTP, there is no issue with the transmission of the address. This should simply match against a registered domain name.
Browsers I tested (Firefox, Chromium) always encode the hostname in IDNA.
I made some tests on a random hostname with unicode characters served by nginx. Nginx doesn't care about the encoding and applies the virtual host rules matching the exact string. Ie: with xn--9caa.com I see the right website, while éé.com returns a 404 probably because only the IDNA encoded version is specified in the configuration.
Expected behaviour
session.get("http://xn--9caa.com/") must make a request with Host: xn--9caa.com (encoded host).
session.get("http://éé.com/") should make a request with Host: xn--9caa.com (encoded host)
Actual behaviour
session.get("http://xn--9caa.com/") makes a request with a decoded host: Host: éé.com (UTF-8 encoded host).
session.get("http://éé.com/") makes a request with Host: éé.com too.
Long story short
aiohttp's client handle IDNA hostnames in a way that seems inconsistent: the
Host
header always contains a dedcoded utf-8 value which seems problematic.For instance:
session.get("http://éé.com/")
makes a request withHost: éé.com
session.get("http://xn--9caa.com/")
also makes a request withHost: éé.com
.While it's unclear to me if an unicode hostname should always be IDNA encoded (see bellow), it should at least not be decoded when explicitly encoded by the caller.
IDNA or not?
The newest HTTP/1 RFCs doesn't specify the encoding of the headers, but recommend to handle them as US-ASCII characters only for security reasons (see: https://tools.ietf.org/html/rfc7230#section-3, especially the last paragraph of 3.2.4).
Most of the resources I read from the W3C or the IETF (normative or not) tells that the hostname should always be encoded, for instance, https://www.w3.org/International/articles/idn-and-iri/#resolvedomain says:
Browsers I tested (Firefox, Chromium) always encode the hostname in IDNA.
I made some tests on a random hostname with unicode characters served by nginx. Nginx doesn't care about the encoding and applies the virtual host rules matching the exact string. Ie: with
xn--9caa.com
I see the right website, whileéé.com
returns a 404 probably because only the IDNA encoded version is specified in the configuration.Expected behaviour
session.get("http://xn--9caa.com/")
must make a request withHost: xn--9caa.com
(encoded host).session.get("http://éé.com/")
should make a request withHost: xn--9caa.com
(encoded host)Actual behaviour
session.get("http://xn--9caa.com/")
makes a request with a decoded host:Host: éé.com
(UTF-8 encoded host).session.get("http://éé.com/")
makes a request withHost: éé.com
too.Suggested fix
It seems that
self.url.raw_host
should be used rather thanself.url.host
inClientRequest
:https://github.com/KeepSafe/aiohttp/blob/master/aiohttp/client_reqrep.py#L168
(according to my quick test, yarl.URL.raw_host is always return the idna-encoded version, regardless of the encoding of the input url).
The text was updated successfully, but these errors were encountered: