Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Zimit2: another encoding problem with solidarité-numérique #221

Closed
benoit74 opened this issue Mar 26, 2024 · 10 comments
Closed

Zimit2: another encoding problem with solidarité-numérique #221

benoit74 opened this issue Mar 26, 2024 · 10 comments
Assignees
Labels
bug Something isn't working
Milestone

Comments

@benoit74
Copy link
Collaborator

@Jaifroid reported a potential encoding issue in solidarité-numérique test ZIM: #218 (comment) (while a new issue would have been way better ^^)

ZIM is at https://tmp.kiwix.org/ci/test-warc/solidarite-numerique_2024-03-18.zim

I don't know where the "Questions / réponses" is, so I used another example at https://www.solidarite-numerique.fr/thematiques/demarches-administratives/ and I've focused on the é character below.

image

I've downloaded both the HTML file from the ZIM and from online and I've found the same content:

image

I've tested the ZIM file in kiwix-serve and I did not faced any encoding problem.

I don't get what's going on, I will continue testing on my machine with Kiwix JS and Kiwix PWA.

@benoit74 benoit74 added the question Further information is requested label Mar 26, 2024
@benoit74
Copy link
Collaborator Author

I don't face the issue in Kiwix JS:

image

Nor in Kiwix PWA (3.0.1)

@Jaifroid
Copy link

OK, let me check if it's just some pages affected, and I'll try to reproduce.

@Jaifroid
Copy link

Open the ZIM, and on the landing page, click "Lire la suite" under the Comprendre les cookies article. This is what I see in Kiwix Desktop:

image

And same in PWA (v3.1.5) - NB, you will need to let it update to be able to click on Lire la suite:

image

Also same in Kiwix JS browser extension, except I haven't fixed that yet so it's not possible to open the page by clicking on "lire la suite", instead you have to search for the page "Comprendre les cookies".

@Jaifroid
Copy link

Jaifroid commented Mar 26, 2024

I also see it in the article under "Qu'est-ce que le PIX...", but NOT in the article "Mettre à jour Windows 10..." (I get to these from "Lire la suite", as the tiles can't yet be clicked on due to #209).

It's possible that this issue only affects older articles, made before the site introduced UTF-8 as standard? Quite a few of the other articles I have tested are fine.

The affected articles are displayed correctly on the original Web site, e.g. https://www.solidarite-numerique.fr/tutoriels/comprendre-les-cookies/?thematique=internet . So it does appear to be a scraping issue (if you can reproduce it).

@Jaifroid
Copy link

I now think it's more likely to be a problem with the readers "assuming" (correctly, according to OpenZIM spec) that all content in the ZIM is UTF-8 encoded. In fact, readers should recognize content encoded with another code page, and decode accordingly in the case of Zimit archives. But that's a bit of a challenge to implement in readers for what is "probably" an edge case? I don't know if, say, Chinese sites might regularly use different code pages. I know they often use UTF-3-byte, which has been an issue elsewhere (kiwix/kiwix-android#3587).

@benoit74
Copy link
Collaborator Author

Thank you, I wasn't lucky enough to click on the right link then!

Content inside the ZIM is broken, so not at all a reader problem (and all readers are impacted obviously):

image

Definitely a problem to solve.

@benoit74 benoit74 added bug Something isn't working and removed question Further information is requested labels Mar 26, 2024
@benoit74 benoit74 changed the title Zimit2: is there another encoding problem with solidarité-numérique Zimit2: another encoding problem with solidarité-numérique Mar 26, 2024
@benoit74 benoit74 modified the milestone: 2.0.0 May 2, 2024
@benoit74
Copy link
Collaborator Author

benoit74 commented May 14, 2024

I focused on https://www.solidarite-numerique.fr/tutoriels/comprendre-les-cookies/ page

Problem is that content is declared as UTF-8 in content-type header.

chardet also detects it as most probably utf-8 encoded (with a 99% confidence).

And it is indeed mostly UTF-8 encoded (or at least it contains a lot of UTF-8 chars).

Unfortunately, there is an unsupported UTF-8 character in the document:

image

0xC3 at position 43050 (0xA82A)

image

chardet then suggests it might be MacRoman encoded (with a 71% confidence). And this decoding works ok, so we use this decoding for the whole file. And this breaks everything since the real encoding is UTF-8, only this character was badly encoded.

As a refresher, current logic is that we first try to decode content with the encoding declared in the HTTP content-type header (here text/html; charset=UTF-8) and then we use the first 1024 bytes that we feed to chardet. chardet detects all potential encodings, and we try them one by one (ordered from most probable one) until there is one which succeed. First succeeding encoding is used.

This approach is probably too naive / doomed to fail as soon as there is a bad character.

I propose the following alternative:

  • continue to first try with the encoding provided in content-type header
  • if it fails, try the most probable encoding proposed by chardet (and only the most probable encoding, not all)
  • if it fails, split the content line by line (using the \n value of the most probable encoding detected by chardet) and decode line-by-line:
    • try again the most probable encoding proposed by chardet
    • if it fails, ask chardet to guess again an encoding for the given line and try to decode with it
    • if it fails, ask Python to be permissive (i.e. replace unknown characters in the last encoding tried)

This is meant to:

  • not try less probable encodings since it will fail to convert some chars of the whole document anyway, or at least create a lot of garbage as it did here (I assume we prefer to have one line of an HTML document with few badly formatted characters rather than a whole document with incorrectly formatted characters)
  • still try our best to decode as much as possible the document correctly

Obviously there might still be some edge cases where splitting line-by-line will not work (e.g. minified JS which is a one-liner usually) and we might still have unknown characters.

@benoit74
Copy link
Collaborator Author

In fact, this is mostly what we've agreed we need to do in #185

@benoit74
Copy link
Collaborator Author

After some tests, the alternative I proposed seems overly complex and I do not achieve to create a virtual test case where it provides any value.

Most probably because when there is only few characters which are improperly encoded, it is hard for chardet to guess their encoding anyway, and there are often surrounded by perfectly encoded characters.

I wonder if in fact we should rather only simplify the logic to not try the multiple encodings guessed by chardet in a row, but only try the most probable encoding detected by chardet, and simply ignore (not replace) all bad characters which failed to be decoded.

This is clearly a tradeoff, but seems indeed closer to produce a readable document in most situations.

@benoit74
Copy link
Collaborator Author

Fixed by #260

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants