utf-8 page wrongly detected as ISO-8859-1 #5445

klartext · 2020-05-03T12:21:41Z

A webpage has been detcected as being ISO-8859-1 encoded, even though it is encoded in utf-8.

Expected Result

Correct classification as utf-8.

Actual Result

utf-8 page detected as ISO-8859-1.

Reproduction Steps

#!/usr/bin/python

import requests

# example url
url = "https://digitalezivilgesellschaft.org/"

# get the page and print the supposed encoding
response = requests.get(url)
print(response.encoding)

Compare that with

rm -f index.html; wget -nv https://digitalezivilgesellschaft.org/  2>/dev/null&& file index.html  | grep index | tail -1

System Information

$ python -m requests.help

explore_requests_bug$ python -m requests.help
{
  "chardet": {
    "version": "3.0.4"
  },
  "cryptography": {
    "version": ""
  },
  "idna": {
    "version": "2.9"
  },
  "implementation": {
    "name": "CPython",
    "version": "3.8.2"
  },
  "platform": {
    "release": "5.6.8-arch1-1",
    "system": "Linux"
  },
  "pyOpenSSL": {
    "openssl_version": "",
    "version": null
  },
  "requests": {
    "version": "2.23.0"
  },
  "system_ssl": {
    "version": "1010107f"
  },
  "urllib3": {
    "version": "1.25.9"
  },
  "using_pyopenssl": false
}

This concrete problem seems to be related to the more general issue
#2086

The text was updated successfully, but these errors were encountered:

GoddessLuBoYan · 2020-06-29T23:06:54Z

you can gei it like this:

response.content.decode("utf8")

not

response.text

nateprewitt · 2020-07-21T06:06:03Z

Hi @klartext, if a webpage fails to specify its encoding, we'll default to ISO-8859-1 which has been the spec since RFC 7230. This is starting to change, but we're in an inconsistent state for now. We've kept the functionality for backwards compatibility and it still seems to be the correct case more often.

To resolve this issue you can either do as @GoddessLuBoYan suggests above, or set the encoding attribute on your Response object before calling text. This will do the decoding for you.

nateprewitt closed this as completed Jul 21, 2020

github-actions bot locked as resolved and limited conversation to collaborators Aug 30, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

utf-8 page wrongly detected as ISO-8859-1 #5445

utf-8 page wrongly detected as ISO-8859-1 #5445

klartext commented May 3, 2020 •

edited

Loading

GoddessLuBoYan commented Jun 29, 2020

nateprewitt commented Jul 21, 2020

utf-8 page wrongly detected as ISO-8859-1 #5445

utf-8 page wrongly detected as ISO-8859-1 #5445

Comments

klartext commented May 3, 2020 • edited Loading

Expected Result

Actual Result

Reproduction Steps

System Information

GoddessLuBoYan commented Jun 29, 2020

nateprewitt commented Jul 21, 2020

klartext commented May 3, 2020 •

edited

Loading