-
-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Zimit2: another encoding problem with solidarité-numérique #221
Comments
OK, let me check if it's just some pages affected, and I'll try to reproduce. |
Open the ZIM, and on the landing page, click "Lire la suite" under the Comprendre les cookies article. This is what I see in Kiwix Desktop: And same in PWA (v3.1.5) - NB, you will need to let it update to be able to click on Lire la suite: Also same in Kiwix JS browser extension, except I haven't fixed that yet so it's not possible to open the page by clicking on "lire la suite", instead you have to search for the page "Comprendre les cookies". |
I also see it in the article under "Qu'est-ce que le PIX...", but NOT in the article "Mettre à jour Windows 10..." (I get to these from "Lire la suite", as the tiles can't yet be clicked on due to #209). It's possible that this issue only affects older articles, made before the site introduced UTF-8 as standard? Quite a few of the other articles I have tested are fine. The affected articles are displayed correctly on the original Web site, e.g. https://www.solidarite-numerique.fr/tutoriels/comprendre-les-cookies/?thematique=internet . So it does appear to be a scraping issue (if you can reproduce it). |
I now think it's more likely to be a problem with the readers "assuming" (correctly, according to OpenZIM spec) that all content in the ZIM is UTF-8 encoded. In fact, readers should recognize content encoded with another code page, and decode accordingly in the case of Zimit archives. But that's a bit of a challenge to implement in readers for what is "probably" an edge case? I don't know if, say, Chinese sites might regularly use different code pages. I know they often use UTF-3-byte, which has been an issue elsewhere (kiwix/kiwix-android#3587). |
I focused on https://www.solidarite-numerique.fr/tutoriels/comprendre-les-cookies/ page Problem is that content is declared as UTF-8 in content-type header.
And it is indeed mostly UTF-8 encoded (or at least it contains a lot of UTF-8 chars). Unfortunately, there is an unsupported UTF-8 character in the document: 0xC3 at position 43050 (0xA82A) chardet then suggests it might be As a refresher, current logic is that we first try to decode content with the encoding declared in the HTTP content-type header (here This approach is probably too naive / doomed to fail as soon as there is a bad character. I propose the following alternative:
This is meant to:
Obviously there might still be some edge cases where splitting line-by-line will not work (e.g. minified JS which is a one-liner usually) and we might still have unknown characters. |
In fact, this is mostly what we've agreed we need to do in #185 |
After some tests, the alternative I proposed seems overly complex and I do not achieve to create a virtual test case where it provides any value. Most probably because when there is only few characters which are improperly encoded, it is hard for chardet to guess their encoding anyway, and there are often surrounded by perfectly encoded characters. I wonder if in fact we should rather only simplify the logic to not try the multiple encodings guessed by chardet in a row, but only try the most probable encoding detected by chardet, and simply ignore (not replace) all bad characters which failed to be decoded. This is clearly a tradeoff, but seems indeed closer to produce a readable document in most situations. |
Fixed by #260 |
@Jaifroid reported a potential encoding issue in solidarité-numérique test ZIM: #218 (comment) (while a new issue would have been way better ^^)
ZIM is at https://tmp.kiwix.org/ci/test-warc/solidarite-numerique_2024-03-18.zim
I don't know where the "Questions / réponses" is, so I used another example at https://www.solidarite-numerique.fr/thematiques/demarches-administratives/ and I've focused on the
é
character below.I've downloaded both the HTML file from the ZIM and from online and I've found the same content:
C3A9
(see below) which is correct value for UTF-8 (see https://www.fileformat.info/info/unicode/char/00e9/index.htm)I've tested the ZIM file in kiwix-serve and I did not faced any encoding problem.
I don't get what's going on, I will continue testing on my machine with Kiwix JS and Kiwix PWA.
The text was updated successfully, but these errors were encountered: