Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

utf-8 page wrongly detected as ISO-8859-1 #5445

Closed
klartext opened this issue May 3, 2020 · 2 comments
Closed

utf-8 page wrongly detected as ISO-8859-1 #5445

klartext opened this issue May 3, 2020 · 2 comments

Comments

@klartext
Copy link

klartext commented May 3, 2020

A webpage has been detcected as being ISO-8859-1 encoded, even though it is encoded in utf-8.

Expected Result

Correct classification as utf-8.

Actual Result

utf-8 page detected as ISO-8859-1.

Reproduction Steps

#!/usr/bin/python

import requests

# example url
url = "https://digitalezivilgesellschaft.org/"

# get the page and print the supposed encoding
response = requests.get(url)
print(response.encoding)

Compare that with

rm -f index.html; wget -nv https://digitalezivilgesellschaft.org/  2>/dev/null&& file index.html  | grep index | tail -1

System Information

$ python -m requests.help
explore_requests_bug$ python -m requests.help
{
  "chardet": {
    "version": "3.0.4"
  },
  "cryptography": {
    "version": ""
  },
  "idna": {
    "version": "2.9"
  },
  "implementation": {
    "name": "CPython",
    "version": "3.8.2"
  },
  "platform": {
    "release": "5.6.8-arch1-1",
    "system": "Linux"
  },
  "pyOpenSSL": {
    "openssl_version": "",
    "version": null
  },
  "requests": {
    "version": "2.23.0"
  },
  "system_ssl": {
    "version": "1010107f"
  },
  "urllib3": {
    "version": "1.25.9"
  },
  "using_pyopenssl": false
}

This concrete problem seems to be related to the more general issue
#2086

@GoddessLuBoYan
Copy link

you can gei it like this:

response.content.decode("utf8")

not

response.text

@nateprewitt
Copy link
Member

Hi @klartext, if a webpage fails to specify its encoding, we'll default to ISO-8859-1 which has been the spec since RFC 7230. This is starting to change, but we're in an inconsistent state for now. We've kept the functionality for backwards compatibility and it still seems to be the correct case more often.

To resolve this issue you can either do as @GoddessLuBoYan suggests above, or set the encoding attribute on your Response object before calling text. This will do the decoding for you.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 30, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants