Zimit2: another encoding problem with solidarité-numérique #221

benoit74 · 2024-03-26T14:21:41Z

@Jaifroid reported a potential encoding issue in solidarité-numérique test ZIM: #218 (comment) (while a new issue would have been way better ^^)

ZIM is at https://tmp.kiwix.org/ci/test-warc/solidarite-numerique_2024-03-18.zim

I don't know where the "Questions / réponses" is, so I used another example at https://www.solidarite-numerique.fr/thematiques/demarches-administratives/ and I've focused on the é character below.

I've downloaded both the HTML file from the ZIM and from online and I've found the same content:

HTML headers says the file is encoded in UTF-8
the &eactue; character is encoded as C3A9 (see below) which is correct value for UTF-8 (see https://www.fileformat.info/info/unicode/char/00e9/index.htm)

I've tested the ZIM file in kiwix-serve and I did not faced any encoding problem.

I don't get what's going on, I will continue testing on my machine with Kiwix JS and Kiwix PWA.

The text was updated successfully, but these errors were encountered:

benoit74 · 2024-03-26T14:25:01Z

I don't face the issue in Kiwix JS:

Nor in Kiwix PWA (3.0.1)

Jaifroid · 2024-03-26T14:38:26Z

OK, let me check if it's just some pages affected, and I'll try to reproduce.

Jaifroid · 2024-03-26T14:46:34Z

Open the ZIM, and on the landing page, click "Lire la suite" under the Comprendre les cookies article. This is what I see in Kiwix Desktop:

And same in PWA (v3.1.5) - NB, you will need to let it update to be able to click on Lire la suite:

Also same in Kiwix JS browser extension, except I haven't fixed that yet so it's not possible to open the page by clicking on "lire la suite", instead you have to search for the page "Comprendre les cookies".

Jaifroid · 2024-03-26T14:53:44Z

I also see it in the article under "Qu'est-ce que le PIX...", but NOT in the article "Mettre à jour Windows 10..." (I get to these from "Lire la suite", as the tiles can't yet be clicked on due to #209).

It's possible that this issue only affects older articles, made before the site introduced UTF-8 as standard? Quite a few of the other articles I have tested are fine.

The affected articles are displayed correctly on the original Web site, e.g. https://www.solidarite-numerique.fr/tutoriels/comprendre-les-cookies/?thematique=internet . So it does appear to be a scraping issue (if you can reproduce it).

Jaifroid · 2024-03-26T15:21:41Z

I now think it's more likely to be a problem with the readers "assuming" (correctly, according to OpenZIM spec) that all content in the ZIM is UTF-8 encoded. In fact, readers should recognize content encoded with another code page, and decode accordingly in the case of Zimit archives. But that's a bit of a challenge to implement in readers for what is "probably" an edge case? I don't know if, say, Chinese sites might regularly use different code pages. I know they often use UTF-3-byte, which has been an issue elsewhere (kiwix/kiwix-android#3587).

benoit74 · 2024-03-26T16:58:53Z

Thank you, I wasn't lucky enough to click on the right link then!

Content inside the ZIM is broken, so not at all a reader problem (and all readers are impacted obviously):

Definitely a problem to solve.

benoit74 · 2024-05-14T11:47:33Z

I focused on https://www.solidarite-numerique.fr/tutoriels/comprendre-les-cookies/ page

Problem is that content is declared as UTF-8 in content-type header.

chardet also detects it as most probably utf-8 encoded (with a 99% confidence).

And it is indeed mostly UTF-8 encoded (or at least it contains a lot of UTF-8 chars).

Unfortunately, there is an unsupported UTF-8 character in the document:

0xC3 at position 43050 (0xA82A)

chardet then suggests it might be MacRoman encoded (with a 71% confidence). And this decoding works ok, so we use this decoding for the whole file. And this breaks everything since the real encoding is UTF-8, only this character was badly encoded.

As a refresher, current logic is that we first try to decode content with the encoding declared in the HTTP content-type header (here text/html; charset=UTF-8) and then we use the first 1024 bytes that we feed to chardet. chardet detects all potential encodings, and we try them one by one (ordered from most probable one) until there is one which succeed. First succeeding encoding is used.

This approach is probably too naive / doomed to fail as soon as there is a bad character.

I propose the following alternative:

continue to first try with the encoding provided in content-type header
if it fails, try the most probable encoding proposed by chardet (and only the most probable encoding, not all)
if it fails, split the content line by line (using the \n value of the most probable encoding detected by chardet) and decode line-by-line:
- try again the most probable encoding proposed by chardet
- if it fails, ask chardet to guess again an encoding for the given line and try to decode with it
- if it fails, ask Python to be permissive (i.e. replace unknown characters in the last encoding tried)

This is meant to:

not try less probable encodings since it will fail to convert some chars of the whole document anyway, or at least create a lot of garbage as it did here (I assume we prefer to have one line of an HTML document with few badly formatted characters rather than a whole document with incorrectly formatted characters)
still try our best to decode as much as possible the document correctly

Obviously there might still be some edge cases where splitting line-by-line will not work (e.g. minified JS which is a one-liner usually) and we might still have unknown characters.

benoit74 · 2024-05-14T12:04:33Z

In fact, this is mostly what we've agreed we need to do in #185

benoit74 · 2024-05-14T13:22:54Z

After some tests, the alternative I proposed seems overly complex and I do not achieve to create a virtual test case where it provides any value.

Most probably because when there is only few characters which are improperly encoded, it is hard for chardet to guess their encoding anyway, and there are often surrounded by perfectly encoded characters.

I wonder if in fact we should rather only simplify the logic to not try the multiple encodings guessed by chardet in a row, but only try the most probable encoding detected by chardet, and simply ignore (not replace) all bad characters which failed to be decoded.

This is clearly a tradeoff, but seems indeed closer to produce a readable document in most situations.

benoit74 · 2024-05-21T09:41:02Z

Fixed by #260

benoit74 added the question Further information is requested label Mar 26, 2024

benoit74 added bug Something isn't working and removed question Further information is requested labels Mar 26, 2024

benoit74 changed the title ~~Zimit2: is there another encoding problem with solidarité-numérique~~ Zimit2: another encoding problem with solidarité-numérique Mar 26, 2024

benoit74 mentioned this issue Apr 18, 2024

Problems with Zimit ZIM file format and URLs #86

Closed

benoit74 modified the milestone: 2.0.0 May 2, 2024

This was referenced May 17, 2024

Revisit decoding of documents from binary to string #260

Merged

Zimit2: do not crash when a rewriten resource (HTML/CSS/JS) has multiple encoding inside #185

Closed

benoit74 closed this as completed May 21, 2024

benoit74 self-assigned this May 21, 2024

benoit74 mentioned this issue Jun 14, 2024

Automated encoding detection is still not working properly #312

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Zimit2: another encoding problem with solidarité-numérique #221

Zimit2: another encoding problem with solidarité-numérique #221

benoit74 commented Mar 26, 2024

benoit74 commented Mar 26, 2024

Jaifroid commented Mar 26, 2024

Jaifroid commented Mar 26, 2024

Jaifroid commented Mar 26, 2024 •

edited

Loading

Jaifroid commented Mar 26, 2024

benoit74 commented Mar 26, 2024

benoit74 commented May 14, 2024 •

edited

Loading

benoit74 commented May 14, 2024

benoit74 commented May 14, 2024

benoit74 commented May 21, 2024

Zimit2: another encoding problem with solidarité-numérique #221

Zimit2: another encoding problem with solidarité-numérique #221

Comments

benoit74 commented Mar 26, 2024

benoit74 commented Mar 26, 2024

Jaifroid commented Mar 26, 2024

Jaifroid commented Mar 26, 2024

Jaifroid commented Mar 26, 2024 • edited Loading

Jaifroid commented Mar 26, 2024

benoit74 commented Mar 26, 2024

benoit74 commented May 14, 2024 • edited Loading

benoit74 commented May 14, 2024

benoit74 commented May 14, 2024

benoit74 commented May 21, 2024

Jaifroid commented Mar 26, 2024 •

edited

Loading

benoit74 commented May 14, 2024 •

edited

Loading