Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make sure content is checked when setting encoding #1589

Closed
wants to merge 1 commit into from

Conversation

jdennis
Copy link

@jdennis jdennis commented Sep 10, 2013

HTML pages which declared their charset encoding only in their content
had the wrong encoding applied because the content was never checked.

Currently only the request headers are checked for the charset
encoding, if absent the apparent_encoding heuristic is applied. But
the W3.org doc says one should check the header first for a charset
declaration, then if that's absent check the meta tags in the content
for a charset encoding declaration. It also says if no charset
encoding declaration is found one should assume UTF-8, not ISO-8859-1
(a bad recommendation from the early days of the web).

This patch does the following:

  • Removes the default ISO-8859-1 from get_encoding_from_headers(),
    it's wrong for 2 reasons, 1) get_encoding_from_headers() should
    return None if no encoding was found in the header, otherwise how
    can you tell it was absent in the headers? 2) It's the wrong default
    for the contemporary web. Also because get_encoding_from_headers()
    always returned an encoding any subsequent logic failed to execute.

  • The Request text property now does a 4 stage check for encoding in
    this priority order:

    1. encoding declared in the headers
    2. encoding declared in the content (selects the highest priority
      encoding if more than one encoding is declared in the content)
    3. apply apparent_encoding hueristic (includes BOM)
    4. if none of the above default to UTF-8

HTML pages which declared their charset encoding only in their content
had the wrong encoding applied because the content was never checked.

Currently only the request headers are checked for the charset
encoding, if absent the apparent_encoding heuristic is applied. But
the W3.org doc says one should check the header first for a charset
declaration, then if that's absent check the meta tags in the content
for a charset encoding declaration. It also says if no charset
encoding declaration is found one should assume UTF-8, not ISO-8859-1
(a bad recommendation from the early days of the web).

This patch does the following:

* Removes the default ISO-8859-1 from get_encoding_from_headers(),
  it's wrong for 2 reasons, 1) get_encoding_from_headers() should
  return None if no encoding was found in the header, otherwise how
  can you tell it was absent in the headers? 2) It's the wrong default
  for the contemporary web. Also because get_encoding_from_headers()
  always returned an encoding any subsequent logic failed to execute.

* The Request text property now does a 4 stage check for encoding in
  this priority order:

  1) encoding declared in the headers
  2) encoding declared in the content (selects the highest priority
     encoding if more than one encoding is declared in the content)
  3) apply apparent_encoding hueristic (includes BOM)
  4) if none of the above default to UTF-8
@jdennis
Copy link
Author

jdennis commented Sep 10, 2013

somehow I didn't get the issue and pull request linked together, the issue is 1588

@Lukasa
Copy link
Member

Lukasa commented Sep 10, 2013

Thanks for this @jdennis! It's great work.

Unfortunately, this pull request won't be accepted in its current form. This is for a few reasons:

  1. Requests used to check the content header of HTML pages for charsets in older versions. We explicitly removed this functionality because Requests is a HTTP library, not a HTML library. We should not assume that returned data will be HTML (often it won't be). This fact means that the notion of checking the content will not be accepted into Requests in any form, sadly.
  2. The ISO-8859-1 default applies not only to HTML but any form of text encoding delivered by HTTP where no other encoding is specified. This behaviour is given in RFC 2616 (aka the HTTP/1.1 spec), and we describe our compliance with it here.
    We implemented this because it was causing genuine confusion with older websites and web servers. Choosing UTF-8 as a default is just as arbitrary a choice as ISO-8859-1, with the added disadvantage of being directly against the relevant specification. Until RFC 2616 is superseded (it will be soon), changing this behaviour would be unwise.
    (It's worth noting that get_encoding_from_headers() does not always return an encoding. It will always return an encoding with correctly Content-Typed HTML, but there are plenty of other things accessed via HTTP.)

With that said, this is an excellent Pull Request and thank you so much for opening it. =) Unfortunately, it's just not in line with how we want the library to function. I'm sorry we can't accept it, and please do keep contributing! 🍰

@jdennis
Copy link
Author

jdennis commented Sep 10, 2013

Fair enough. How about if the content-type header is text/html then examine the content?

The problem I'm trying to solve is that when I using requests and use the text attribute I'm getting garbage. Here's the scenario:

Header has content-type=text/html, but there is no charset specifier. In the content is this:

<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />

Note the utf-8 specifier.

But get_encoding_from_headers() which is invoked in the text property decides the encoding is ISO-8859-1 even though there is nothing to indicate this is the encoding. It this passes this bogus encoding to unicode and returns that as the text, which produces corrupted text.

If requests is a HTTP only library only then why is there a text property that operates on the raw content, that's inconsistent with that goal.

It seems to me one of a few things should be going on here. If a user want raw HTTP they should be using the content property, not the text property. If one uses the text property then the content should be checked to see if its html (actually it should check for other text types as well). What it should NOT do is apply the wrong encoding. Make sense?

@jdennis
Copy link
Author

jdennis commented Sep 10, 2013

opps ... I omitted the meta tag which is in the content, sorry. it is ...

@jdennis
Copy link
Author

jdennis commented Sep 10, 2013

Silly me, I guess this text box doesn't escape HTML, maybe this will show up

meta http-equiv="Content-Type" content="text/html; charset=utf-8"

@Lukasa
Copy link
Member

Lukasa commented Sep 11, 2013

@jdennis I edited your first comment to make the meta tag appear. =)

So I'm totally sympathetic to your position here. I'll respond to a few of your points:

  1. If the Content-Type header is text/html, I think we'd be significantly more justified in searching for the meta tag. If you open a PR that does that it has a much better chance of being accepted! I'd love to see it. =)
  2. It's important to be clear on the role of the response.text property. response.text is not limited to working only with HTML. Instead, its purpose is for anything that has a character encoding. This can include HTML, but needn't: for instance, it applies to JSON and XML as well. For that reason, response.text has no right to interrogate the body. All it does is assume the encoding that the HTTP states, unless the HTTP does not state one.

It's worth noting that if you are worried about having problems with HTML, and you know that's what you're fetching, you can use this flow:

import requests

r = requests.get('https://lukasa.co.uk/')
if r.encoding == 'ISO-8859-1':
    r.encoding = requests.utils.get_encodings_from_content(r.content)

At that point, r.text will work correctly.

With that said, I'm prepared to believe that we can make some useful extensions to the encodings flow. For instance, JSON should always be UTF-8, so we could special-case this logic to enforce that. Similarly, for specific MIME types (I'm thinking text/html, application/xhtml+xml, application/xml and application/json at the moment) we could provide some special case logic. In the HTML, XHTML and XML cases, that could use get_encodings_from_content. We will still not set UTF-8 as the fallback default, as that continues to be unwise. =)

Does this sound like an acceptable compromise?

@jdennis
Copy link
Author

jdennis commented Sep 11, 2013

You make good points. Here are my thoughts.

  1. I agree the text property should be aware of the various content-types which have text content. As you point out different content-types have different rules for determining encoding, xml being a good example. I think the text property should make an attempt to interpret encoding specifiers bases on the header content-type. FWIW I do recall looking at how encoding was handled with JSON and my recollection is there was something funky about it, but I forget the details.

  2. I continue to believe get_encoding_from_header() should return None if no charset is specified. It's not correct to return information that's not present. The caller of get_encoding_from_header() can make it's own determination as to what to do if the header did not supply an encoding. JSON is a great example of this issue, with JSON if there is no encoding specified you're supposed to assume UTF-8 (you argue for a different default for HTML, ISO-8859-1), how can one function know the default to apply in a given context? (Unless you pass a default parameter, but it's still better to know if the value was specified or not instead of getting back a value and not knowing if it was specified or came from a default).

  3. Your example of a workflow with the existing code is not correct. You have to know if the header declared an encoding and you have to know if the content declared an encoding, if they both declared it the header value takes precedence. Also get_encodings_from_content() returns a list which may be empty.

A lot of these issues are covered in these two documents.

http://www.w3.org/International/questions/qa-html-encoding-declarations/
http://www.w3.org/International/tutorials/tutorial-char-enc/

I'm not sure that get_encodings_from_content is correct anyway. It's not taking into account the content-type and the idea of N number of of possible encodings for the entire conten (and trying to iterate over them until one succeeds) is dubious.

If you would like I'll code up a possible solution that addresses the concerns above but I don't want to invest the time unless you would be serious about incorporating it. Fair enough? Comments?

@Lukasa
Copy link
Member

Lukasa commented Sep 11, 2013

I actually need to backtrack on the JSON encoding: we have logic for it in Response.json(). On balance, I think that belongs there unless we add more specific behaviour to Response.text, in which case it should be moved. The relevant funky details are that you can use any of the UTF- encodings, I think. Don't hold me to that though.

  1. You and I disagree on the definition of 'information that is present'. =) I find RFC 2616's statement to be clear and unambiguous: if you serve, for example, text/plain with no charset specified, the headers actually do say to use ISO-8859-1. This is because RFC 2616 defines what they say, and that definition clearly states that text/X means ISO-8859-1 unless explicitly overridden in the Content-Type header.

    We should note here that I'm not arguing that HTML has a default of ISO-8859-1, I'm saying that any MIME type beginning text/ should have that default. HTML is a notable example but it's far from the only one.

    As for what the default should be in a given context, that context is defined by the Content-Type header, which is what we're examining anyway. No big deal. =)

  2. Re: My sample workflow above. I believe it to be correct(-ish) within the parameters of the problem. The only time that it isn't correct is if the headers actually do specify ISO-8859-1. I didn't handle that case because I wrote the code in 10 seconds. =)

My proposal is to add the following logic (in Python, but not directly related to any part of the Requests code):

encoding = encoding_from_headers()
if encoding:
    return encoding

if ('text/html' in content_type) or ('application/xhtml+xml' in content_type) or ('application/xml' in content_type):
    encoding = encoding_from_content()
elif ('application/json' in content_type):
    encoding = 'utf-8'

if encoding:
    return encoding

return encoding_from_charade()

Does this seem like a sensible set of logic to you?

Final note: I can't guarantee that a pull request that I'm happy with will get incorporated. Requests is "one man one vote": Kenneth is the man, he has the vote. I'm already tempted to say that the entire discussion above is an overreach, and that Kenneth will believe that Requests simply should stop caring about Content-Type beyond whether it has a charset value (that is, removing the ISO-8859-1 default and not replacing it with anything else).

In fact, let's pose him that exact question (there's no way he has time to read the entire discussion above). I'll also get Ian's opinion:

BDFL Question:
@kennethreitz: Currently Requests will use ISO-8859-1 as the default encoding for anything with Content-Type set to text/<something> and without a charset declaration (as specified by RFC 2616). This can cause problems with HTML that uses <meta> tags to declare a non-ISO-8859-1 encoding. We can do one of three things:

  1. Be smarter. If the Content-Type is text/html or one of a similar set of families, we can search for a relevant <meta> tag.
  2. Stay the same.
  3. Remove the ISO-8859-1 default and stop pretending we know about Content-Types at all.

Preferences? @sigmavirus24, I'd like your opinion too. =)

@kennethreitz
Copy link
Contributor

All intentional design decisions. #2

@jdennis
Copy link
Author

jdennis commented Sep 11, 2013

Re point 1 in the previous comment. You're failing to respect the W3 rules I pointed to. If the header does not specify the encoding but the content does then one uses the content encoding, however if the header specifies encoding it takes precedence over any encoding specified in the content, if neither the header nor the content specifies encoding then use the default. If get_encoding_from_header() always returns a value irregardless of whether it's present or not then how does one implement the precedence logic required by W3 rules? How do you know you're supposed to use the content encoding as opposed to the heading encoding? Does that make sense now?

Also with regards to your proposal that all users of the Requests library should implement their own logic to handle correct encoding as opposed to having the logic in the Requests library does not seem to be very friendly and the source of a lot of bugs. You're asking people to understand what has proven to be obscure logic that is often implemented incorrectly (in fact most programmers usually punt on handling encoding because don't understand it). If the boiler plate code you're asking others to paste into their code has a defect or limitation they won't benefit from any upgrade to the Requests library. It seems to me this logic really belongs somewhere in the library so "things work as I expect without any extra effort".

@Lukasa
Copy link
Member

Lukasa commented Sep 11, 2013

@jdennis I am most definitely ignoring the W3C rules. =) That's because, as mentioned earlier, Requests is not a HTML library, it's a HTTP library. The W3C does not make the HTTP specs, the IETF do, and they have been very clear about what the correct behaviour is here.

The same rationale applies regarding having this logic in the core library. The idea that things should 'work as I expect' (aka the principle of least surprise) is great, but only the things the library actually claims to do should work as you expect. Given that Requests is explicitly a HTTP library, not a HTML library, you should only assume that the HTTP behaviour of the library works as you expect it to. Requests' documentation is clear on what will happen regarding encodings. From the section on response content:

When you make a request, Requests makes educated guesses about the encoding of the response based on the HTTP headers. The text encoding guessed by Requests is used when you access r.text. You can find out what encoding Requests is using, and change it, using the r.encoding property...

We don't claim to parse HTML to find the right encoding, and we don't even claim we'll get it right. We say we'll "make educated guesses based on the HTTP headers", and that's all we do. =)

Finally, we're not asking all Requests users to implement this logic. We're asking the ones who need it to implement it. By and large Requests does not help users with getting their data into a form that is useful to them. The only exception I can think of is Response.json(), and we only add that because it's almost no work for us and overwhelmingly the most common use-case for Requests. =)

@jdennis
Copy link
Author

jdennis commented Sep 11, 2013

You're letting your focus on HTTP only cloud your thinking. You can't implement any content specific rules if the HTTP "content container" lies to you. If the HTTP container does not correctly report the containers metadata (e.g. header attributes) you can't use the HTTP container attributes to implement content specific logic. get_encoding_from_header() should never supply a value that was not in the header. It's the caller of this routine which needs to interpret the absence of the encoding attribute and if it so chooses supply a default or take some other action. In the current implementation supplying the default is occurring in the wrong location.

Before we go any further we have to at a minimum agree on this change, otherwise everything that follows collapses.

@Lukasa
Copy link
Member

Lukasa commented Sep 11, 2013

Does the container lie? If no encoding is specified in Content-Type then Response.encoding is None. You can test for that condition before reading the content, and then choose to do however you please.

Additionally, the headers are available, unchanged, for the user to work with as they choose. I simply don't believe Requests is impeding this behaviour at all. =)

@jdennis
Copy link
Author

jdennis commented Sep 12, 2013

Maybe I have a misunderstanding of how the code works but in adapters.py:build_response() it does this:

response.encoding = get_encoding_from_headers(response.headers)

almost immediately after creating the Response object.

Since get_encoding_from_headers() never returns None then Response.encoding will never be None.

Or do I have a misunderstanding of the logic flow?

@Lukasa
Copy link
Member

Lukasa commented Sep 12, 2013

No, you're right, I temporarily lost all grasp on sanity. (Though it's worth noting that get_encoding_from_headers can return None, just not in the correctly Content-Type'd HTML case).

It seems to me the correct logic for users who need to do this, while following the spirit of the W3C guidelines, is to always check the HTML body for a meta tag. If one is present, even if that disagrees with the headers, that's what I'd follow (it's more likely to be right). Only if that's not present would I concern myself with what's in the HTTP headers. (At least, that's what I'd do if I were writing a HTML library.)

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Sep 8, 2021
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Sep 8, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants