[DETECTION] Finnish in UTF-8 detected as Latin-1 when mistaken html meta element present #537

jkseppan · 2024-10-02T05:45:55Z

Notice
I hereby announce that my raw input is not :

Too small content (<=32 characters) as I do know that ANY charset detector heavily depends on content
Encoded in a deprecated/abandoned encoding that is not even supported by my interpreter

Provide the file
A accessible way of retrieving the file concerned. Host it somewhere with untouched encoding.

https://jouniseppanen.fi/tmp/finnish-utf-8-latin-1-confusion.html

(Note that the web server adds a content type of text/html; charset=utf-8 which is correct, so your browser will likely show the text correctly.)

Verbose output

2024-10-02 08:40:59,849 | Level 5 | Detected declarative mark in sequence. Priority +1 given for latin_1.
2024-10-02 08:40:59,852 | Level 5 | latin_1 passed initial chaos probing. Mean measured chaos is 0.533000 %
2024-10-02 08:40:59,852 | Level 5 | latin_1 should target any language(s) of ['Latin Based']
2024-10-02 08:40:59,857 | Level 5 | We detected language [('English', 0.656), ('Hungarian', 0.5849), ('French', 0.578), ('Spanish', 0.5486), ('Norwegian', 0.5294), ('Dutch', 0.5243), ('Finnish', 0.5221), ('Indonesian', 0.5191), ('Italian', 0.5174), ('Estonian', 0.5152), ('Danish', 0.5047), ('Swedish', 0.4706), ('Slovene', 0.4669), ('Croatian', 0.4662), ('Portuguese', 0.4648), ('Czech', 0.4546), ('Romanian', 0.4492), ('German', 0.4409), ('Slovak', 0.4296), ('Turkish', 0.4224), ('Polish', 0.3995), ('Lithuanian', 0.3933), ('Vietnamese', 0.3714)] using latin_1
2024-10-02 08:40:59,857 | DEBUG | Encoding detection: latin_1 is most likely the one.
{
    "path": "/tmp/finnish-utf-8-latin-1-confusion.html",
    "encoding": "latin_1",
    "encoding_aliases": [
        "8859",
        "cp819",
        "csisolatin1",
        "ibm819",
        "iso8859",
        "iso8859_1",
        "iso_8859_1",
        "iso_8859_1_1987",
        "iso_ir_100",
        "l1",
        "latin",
        "latin1"
    ],
    "alternative_encodings": [],
    "language": "English",
    "alphabets": [
        "Basic Latin",
        "Control character",
        "Latin-1 Supplement"
    ],
    "has_sig_or_bom": false,
    "chaos": 0.533,
    "coherence": 65.6,
    "unicode_path": null,
    "is_preferred": true
}

Expected encoding

This should be UTF-8. One clue is that the output includes the word PÃ¤Ã¤tÃ¶sehdotus which is a mangled version of Päätösehdotus.

Most nontrivial Finnish text will include several instances of the character ä and possibly ö. Upper-case versions Ä and Ö are possible but less common. When UTF-8 is interpreted as Latin-1 or Windows-1252, these become

ä → \xc3\xa4 → Ã¤
ö → \xc3\xb6 → Ã¶
Ä → \xc3\x84 → Ã and a control character, or Ã„
Ä → \xc3\x96 → Ã and a control character, or Ã–

The characters Ã¤¶„ do not appear in normal Finnish text. Ã could possibly appear in foreign names, but would even then seem to be very unlikely in the middle of a word. ¤ is an obscure "currency sign" character, whose codepoint Latin-9 aka ISO-8859-15 reassigned to the euro sign, which does occur in Finnish text but would still be very unlikely in the combination Ã€. (The pilcrow might appear in some typography text and the lowered quote might appear in old-fashioned literature. The en dash is normal.)

Desktop (please complete the following information):

OS: MacOS 14.7
Python version 3.12.6
Package version 3.3.2

Additional context

My guess is that this kind of thing happens when someone set up a CMS in the 1990s when Finnish text was commonly encoded in Latin-1 or Windows-1252, and later the data store was changed to use UTF-8 but the meta tags were neglected.

The text was updated successfully, but these errors were encountered:

Ousret · 2024-10-02T07:30:20Z

This case has been fixed in #538
Will be available in the next release.

##### v3.4.0 (`https://github.com/Ousret/charset_normalizer/blob/HEAD/CHANGELOG.md#340-2024-10-08`) ##### Added - Argument `--no-preemptive` in the CLI to prevent the detector to search for hints. - Support for Python 3.13 ([#512](jawah/charset_normalizer#512)) ##### Fixed - Relax the TypeError exception thrown when trying to compare a CharsetMatch with anything else than a CharsetMatch. - Improved the general reliability of the detector based on user feedbacks. ([#520](jawah/charset_normalizer#520)) ([#509](jawah/charset_normalizer#509)) ([#498](jawah/charset_normalizer#498)) ([#407](jawah/charset_normalizer#407)) ([#537](jawah/charset_normalizer#537)) - Declared charset in content (preemptive detection) not changed when converting to utf-8 bytes. ([#381](jawah/charset_normalizer#381))

##### v3.4.0 ##### Added - Argument `--no-preemptive` in the CLI to prevent the detector to search for hints. - Support for Python 3.13 ([#512](jawah/charset_normalizer#512)) ##### Fixed - Relax the TypeError exception thrown when trying to compare a CharsetMatch with anything else than a CharsetMatch. - Improved the general reliability of the detector based on user feedbacks. ([#520](jawah/charset_normalizer#520)) ([#509](jawah/charset_normalizer#509)) ([#498](jawah/charset_normalizer#498)) ([#407](jawah/charset_normalizer#407)) ([#537](jawah/charset_normalizer#537)) - Declared charset in content (preemptive detection) not changed when converting to utf-8 bytes. ([#381](jawah/charset_normalizer#381))

jkseppan added detection Related to the charset detection mechanism, chaos/mess/coherence help wanted Extra attention is needed labels Oct 2, 2024

Ousret mentioned this issue Oct 2, 2024

🔧 improve detector based on case 537 #538

Merged

Ousret closed this as completed Oct 2, 2024

Ousret mentioned this issue Oct 8, 2024

🔖 Release 3.4.0 #545

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DETECTION] Finnish in UTF-8 detected as Latin-1 when mistaken html meta element present #537

[DETECTION] Finnish in UTF-8 detected as Latin-1 when mistaken html meta element present #537

jkseppan commented Oct 2, 2024

Ousret commented Oct 2, 2024

[DETECTION] Finnish in UTF-8 detected as Latin-1 when mistaken html meta element present #537

[DETECTION] Finnish in UTF-8 detected as Latin-1 when mistaken html meta element present #537

Comments

jkseppan commented Oct 2, 2024

Ousret commented Oct 2, 2024