Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Encodings from content #1087

Closed
akavlie opened this issue Jan 5, 2013 · 6 comments
Closed

Encodings from content #1087

akavlie opened this issue Jan 5, 2013 · 6 comments

Comments

@akavlie
Copy link

akavlie commented Jan 5, 2013

Requests has a get_encodings_from_content() function, but it doesn't seem to be used anywhere -- only get_encoding_from_headers() is used.

Any reason why? I'd think on most pages, trying for the meta tag encoding declaration first will produce better results. See The Verge, for one example (the meta tag declares encoding as utf-8, but Requests detects as ISO-8895-1).

@sigmavirus24
Copy link
Contributor

At first I thought this was an issue with charade, but it seems you're right that this is taking place in requests proper. From what I can tell, The Verge might not be setting the encoding in a way requests expects to see it. Could you post the headers from your response for better debugging? (Alternatively, post the URL you used.) For the sake of satiating my curiousity, could you also use: apparent_encoding and see what that returns? (It is an attribute, not a method)

@piotr-dobrogost
Copy link

Related #156, especially this comment.

@akavlie
Copy link
Author

akavlie commented Jan 8, 2013

@sigmavirus24 I noticed the issue with the title of this article:

http://www.theverge.com/2013/1/4/3836944/robot-band-compressorhead-plays-motorhead-ace-of-spades

a few days ago, it was reporting an encoding of iso-8859-1. I don't see the issue now, however. It would be a little weird if they just happened to fix the issue since I last looked... but I don't have a better explanation. Content-Type is now:

Content-Type: text/html; charset=utf-8

@sigmavirus24
Copy link
Contributor

Hm. I'll try to reproduce it later. The server just could have been
misbehaving and they identified and fixed it. shrug

Thanks for following up though, it's very helpful

@kennethreitz
Copy link
Contributor

We do not parse HTML. We did in the past. That's why the function exists, for those who feel like they need it.

@akavlie
Copy link
Author

akavlie commented Jan 10, 2013

@kennethreitz understood after reviewing your comments in #156. A mention of the function in the docs would be useful.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Sep 9, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants