-
-
Notifications
You must be signed in to change notification settings - Fork 9.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make sure content is checked when setting encoding #1589
Conversation
HTML pages which declared their charset encoding only in their content had the wrong encoding applied because the content was never checked. Currently only the request headers are checked for the charset encoding, if absent the apparent_encoding heuristic is applied. But the W3.org doc says one should check the header first for a charset declaration, then if that's absent check the meta tags in the content for a charset encoding declaration. It also says if no charset encoding declaration is found one should assume UTF-8, not ISO-8859-1 (a bad recommendation from the early days of the web). This patch does the following: * Removes the default ISO-8859-1 from get_encoding_from_headers(), it's wrong for 2 reasons, 1) get_encoding_from_headers() should return None if no encoding was found in the header, otherwise how can you tell it was absent in the headers? 2) It's the wrong default for the contemporary web. Also because get_encoding_from_headers() always returned an encoding any subsequent logic failed to execute. * The Request text property now does a 4 stage check for encoding in this priority order: 1) encoding declared in the headers 2) encoding declared in the content (selects the highest priority encoding if more than one encoding is declared in the content) 3) apply apparent_encoding hueristic (includes BOM) 4) if none of the above default to UTF-8
somehow I didn't get the issue and pull request linked together, the issue is 1588 |
Thanks for this @jdennis! It's great work. Unfortunately, this pull request won't be accepted in its current form. This is for a few reasons:
With that said, this is an excellent Pull Request and thank you so much for opening it. =) Unfortunately, it's just not in line with how we want the library to function. I'm sorry we can't accept it, and please do keep contributing! 🍰 |
Fair enough. How about if the content-type header is text/html then examine the content? The problem I'm trying to solve is that when I using requests and use the text attribute I'm getting garbage. Here's the scenario: Header has content-type=text/html, but there is no charset specifier. In the content is this:
Note the utf-8 specifier. But get_encoding_from_headers() which is invoked in the text property decides the encoding is ISO-8859-1 even though there is nothing to indicate this is the encoding. It this passes this bogus encoding to unicode and returns that as the text, which produces corrupted text. If requests is a HTTP only library only then why is there a text property that operates on the raw content, that's inconsistent with that goal. It seems to me one of a few things should be going on here. If a user want raw HTTP they should be using the content property, not the text property. If one uses the text property then the content should be checked to see if its html (actually it should check for other text types as well). What it should NOT do is apply the wrong encoding. Make sense? |
opps ... I omitted the meta tag which is in the content, sorry. it is ... |
Silly me, I guess this text box doesn't escape HTML, maybe this will show up meta http-equiv="Content-Type" content="text/html; charset=utf-8" |
@jdennis I edited your first comment to make the meta tag appear. =) So I'm totally sympathetic to your position here. I'll respond to a few of your points:
It's worth noting that if you are worried about having problems with HTML, and you know that's what you're fetching, you can use this flow: import requests
r = requests.get('https://lukasa.co.uk/')
if r.encoding == 'ISO-8859-1':
r.encoding = requests.utils.get_encodings_from_content(r.content) At that point, With that said, I'm prepared to believe that we can make some useful extensions to the encodings flow. For instance, JSON should always be UTF-8, so we could special-case this logic to enforce that. Similarly, for specific MIME types (I'm thinking Does this sound like an acceptable compromise? |
You make good points. Here are my thoughts.
A lot of these issues are covered in these two documents. http://www.w3.org/International/questions/qa-html-encoding-declarations/ I'm not sure that get_encodings_from_content is correct anyway. It's not taking into account the content-type and the idea of N number of of possible encodings for the entire conten (and trying to iterate over them until one succeeds) is dubious. If you would like I'll code up a possible solution that addresses the concerns above but I don't want to invest the time unless you would be serious about incorporating it. Fair enough? Comments? |
I actually need to backtrack on the JSON encoding: we have logic for it in
My proposal is to add the following logic (in Python, but not directly related to any part of the Requests code): encoding = encoding_from_headers()
if encoding:
return encoding
if ('text/html' in content_type) or ('application/xhtml+xml' in content_type) or ('application/xml' in content_type):
encoding = encoding_from_content()
elif ('application/json' in content_type):
encoding = 'utf-8'
if encoding:
return encoding
return encoding_from_charade() Does this seem like a sensible set of logic to you? Final note: I can't guarantee that a pull request that I'm happy with will get incorporated. Requests is "one man one vote": Kenneth is the man, he has the vote. I'm already tempted to say that the entire discussion above is an overreach, and that Kenneth will believe that Requests simply should stop caring about In fact, let's pose him that exact question (there's no way he has time to read the entire discussion above). I'll also get Ian's opinion: BDFL Question:
Preferences? @sigmavirus24, I'd like your opinion too. =) |
All intentional design decisions. #2 |
Re point 1 in the previous comment. You're failing to respect the W3 rules I pointed to. If the header does not specify the encoding but the content does then one uses the content encoding, however if the header specifies encoding it takes precedence over any encoding specified in the content, if neither the header nor the content specifies encoding then use the default. If get_encoding_from_header() always returns a value irregardless of whether it's present or not then how does one implement the precedence logic required by W3 rules? How do you know you're supposed to use the content encoding as opposed to the heading encoding? Does that make sense now? Also with regards to your proposal that all users of the Requests library should implement their own logic to handle correct encoding as opposed to having the logic in the Requests library does not seem to be very friendly and the source of a lot of bugs. You're asking people to understand what has proven to be obscure logic that is often implemented incorrectly (in fact most programmers usually punt on handling encoding because don't understand it). If the boiler plate code you're asking others to paste into their code has a defect or limitation they won't benefit from any upgrade to the Requests library. It seems to me this logic really belongs somewhere in the library so "things work as I expect without any extra effort". |
@jdennis I am most definitely ignoring the W3C rules. =) That's because, as mentioned earlier, Requests is not a HTML library, it's a HTTP library. The W3C does not make the HTTP specs, the IETF do, and they have been very clear about what the correct behaviour is here. The same rationale applies regarding having this logic in the core library. The idea that things should 'work as I expect' (aka the principle of least surprise) is great, but only the things the library actually claims to do should work as you expect. Given that Requests is explicitly a HTTP library, not a HTML library, you should only assume that the HTTP behaviour of the library works as you expect it to. Requests' documentation is clear on what will happen regarding encodings. From the section on response content:
We don't claim to parse HTML to find the right encoding, and we don't even claim we'll get it right. We say we'll "make educated guesses based on the HTTP headers", and that's all we do. =) Finally, we're not asking all Requests users to implement this logic. We're asking the ones who need it to implement it. By and large Requests does not help users with getting their data into a form that is useful to them. The only exception I can think of is |
You're letting your focus on HTTP only cloud your thinking. You can't implement any content specific rules if the HTTP "content container" lies to you. If the HTTP container does not correctly report the containers metadata (e.g. header attributes) you can't use the HTTP container attributes to implement content specific logic. get_encoding_from_header() should never supply a value that was not in the header. It's the caller of this routine which needs to interpret the absence of the encoding attribute and if it so chooses supply a default or take some other action. In the current implementation supplying the default is occurring in the wrong location. Before we go any further we have to at a minimum agree on this change, otherwise everything that follows collapses. |
Does the container lie? If no encoding is specified in Additionally, the headers are available, unchanged, for the user to work with as they choose. I simply don't believe Requests is impeding this behaviour at all. =) |
Maybe I have a misunderstanding of how the code works but in adapters.py:build_response() it does this: response.encoding = get_encoding_from_headers(response.headers) almost immediately after creating the Response object. Since get_encoding_from_headers() never returns None then Response.encoding will never be None. Or do I have a misunderstanding of the logic flow? |
No, you're right, I temporarily lost all grasp on sanity. (Though it's worth noting that It seems to me the correct logic for users who need to do this, while following the spirit of the W3C guidelines, is to always check the HTML body for a |
HTML pages which declared their charset encoding only in their content
had the wrong encoding applied because the content was never checked.
Currently only the request headers are checked for the charset
encoding, if absent the apparent_encoding heuristic is applied. But
the W3.org doc says one should check the header first for a charset
declaration, then if that's absent check the meta tags in the content
for a charset encoding declaration. It also says if no charset
encoding declaration is found one should assume UTF-8, not ISO-8859-1
(a bad recommendation from the early days of the web).
This patch does the following:
Removes the default ISO-8859-1 from get_encoding_from_headers(),
it's wrong for 2 reasons, 1) get_encoding_from_headers() should
return None if no encoding was found in the header, otherwise how
can you tell it was absent in the headers? 2) It's the wrong default
for the contemporary web. Also because get_encoding_from_headers()
always returned an encoding any subsequent logic failed to execute.
The Request text property now does a 4 stage check for encoding in
this priority order:
encoding if more than one encoding is declared in the content)