-
-
Notifications
You must be signed in to change notification settings - Fork 9.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Chardet sometimes fails and force the wrong encoding #765
Comments
@josemariaruiz, do you have sample strings to reproduce the bug? |
I wanted to cancel a comment and closed the bug! sorry. The string contains my personal data... can I send it to someone by email? I don't want to see it published everywhere :-/ |
@josemariaruiz, reduce it to the characters that are creating the problem and post the representation of the object here ( |
In the meantime, if you want a work-around, you can use In short, instead of: r = requests.get('http://example.com/')
print r.text use: r = requests.get('http://example.com/')
print r.content.decode('utf-8') |
Lukasa, this is exactly what I'm doing now: use r.content and json.loads(). The string is my own name: "José María Ruiz" (this is like in http://xkcd.com/327/) >>> repr(response.content)
'...."name":["Jos\\xc3\\xa9 Mar\\xc3\\xada Ruiz"]....' |
After installing the last chardet version with pip I made this test (the webservice is firewalled sorry): $ curl "WEBSERVICE URL" | chardet
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 967 100 967 0 0 1968 0 --:--:-- --:--:-- --:--:-- 2571
<stdin>: ISO-8859-2 (confidence: 0.82) If I dump the data to a file and use file: $ file dump.txt
dump.txt: UTF-8 Unicode text, with very long lines, with no line terminators The problem is related to the file and not to my name. The data returned by the webservice that cause the problem have my name once while when the name appears twice the detected encoding is utf-8. |
Requests uses |
http://www.w3.org/Protocols/rfc2616/rfc2616-sec3.html#sec3.7.1
|
Hi Kenneth this is a dump of the headers from Amazon WS CloudSearch (the service I'm requesting): {'connection': 'keep-alive',
'content-length': '5344',
'content-type': 'application/json',
'date': 'Wed, 08 Aug 2012 08:39:36 GMT',
'server': 'Server'} Does content-type begins with "text"? And the problem is not ISO-8859-1, but that the code then uses Chardet and asign as encoding ISO-8859-2! |
Ah, sorry I misread. Unfortunately, this detection cannot be improved at this time. Amazon should really be providing their charset in the headers. Luckily, you can actually set the value of encoding yourself to suit your needs. |
Note that JSON is always supposed to be encoded in one of the UTF encodings; UTF-8 is the default, but -16 and -32 are allowed too. The improvement thus is to not use I'd say: when no encoding has been set, use (Correction: json.loads only handles UTF-8 or Unicode objects). |
Useful test case; it does not specify an encoding, the contained JSON is correctly encoded as UTF-8, but chardet pegs it as Work-around: set |
0.14.2 was just released, which includes my proposed JSON UTF handling code from pull request #909. That release will now automatically detect what UTF encoding was used for a JSON response without an encoding set. |
✨ 🍰 ✨ |
I've a problem that I've traced back to the method/property "text" of Request. Chadet is returning the wrong encoding "ISO-8859-2" when the string is UTF-8! The result are, as you can imagine, nasty strings everywhere you have something that is not ascii. This is happening in production :(
What can I do?
You are shipping chardet with requests :(
The text was updated successfully, but these errors were encountered: