bytecode string returned when page has charset=UTF-8 #147

j0hnsmith · 2011-08-31T16:14:02Z

I had a situation (with both requests and urllib2) where a page that had <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /> was being returned as a bytestring <type 'str'> but did contain unicode characters (due to a server misconfiguration I assume). So when I tried to use it I got the classic UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 419: ordinal not in range(128)

Is this something that requests could fix? Is this something that requests would want to fix?

The text was updated successfully, but these errors were encountered:

kennethreitz · 2011-09-14T12:47:51Z

Requests only attempts to decode charsets specified in HTTP Headers (in the upcoming release).

However, there is a utility function that will attempt to decode based on the HTML tags. If the content isn't actually in the specified encoding, however, there's nothing that can be done (aside from ignoring the invalid charecters).

kennethreitz closed this as completed Sep 14, 2011

itsadok mentioned this issue Nov 14, 2013

[Suggestion] Simplify charset handling #1737

Open

github-actions bot locked as resolved and limited conversation to collaborators Sep 9, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bytecode string returned when page has charset=UTF-8 #147

bytecode string returned when page has charset=UTF-8 #147

j0hnsmith commented Aug 31, 2011

kennethreitz commented Sep 14, 2011

bytecode string returned when page has charset=UTF-8 #147

bytecode string returned when page has charset=UTF-8 #147

Comments

j0hnsmith commented Aug 31, 2011

kennethreitz commented Sep 14, 2011