Revisit handling of special characters in ZIM / HTML URLs #218

benoit74 · 2024-03-18T10:02:12Z

Rationale

Following openzim/libzim#865 (comment) experiments, it seems now clear that:

ZIM entry URL/Path must be fully decoded to be reachable from the reader
HTML URL must be fully encoded to be reachable from the reader (e.g. querystring not dropped by webserver before being passed to libzim)

Following webrecorder/browsertrix-crawler#492, it is also now clear that URLs found in WARC record (WARC-Target-URI) is always URL-encoded.

This PR implements required changes to match these two "new" understandings.

Changes

Main changes are done with commit 202d6c9:

completely revisit the logic to compute ZIM path and to rewrite URLs found in documents:
- ZIM path is always fully decoded
  - hostname is puny-decoded
  - path and querystring are url decoded
- Document URLs are fully url-encoded, except the fragment (which stays client-side anyway)
  - there is no more querystring, so it is dropped by kiwix-serve webserver or other clients
  - the ZIM entry is directly and properly addressed under all conditions
- see explanations in src/warc2zim/url_rewriting.py (document beginning + normalize function)
known_urls argument / attribute is renamed to ~~existing_zim_paths~~ expected_zim_items (and same name is used in the whole codebase for clarity)
indexed_urls attribute is renamed to added_zim_items
rename reduce method to apply_fuzzy_rules (convey way more meaning / less confusion)
rename from_normalized method to get_document_uri
Add HttpUrl and ZimPath classes so that it is now way clearer when we are dealing with a URL and when we are dealing with a Path
- many renaming + code adaptations linked to these two new classes
- use these classes mainly in apply_fuzzy_rules, get_document_uri and normalize
Many tests have been rewritten:
- some where assuming invalid values
- some where doing to much logic to generate test cases while an exhaustive list is way easier to understand and ensures we are testing what we intend to test (some tests where using almost exactly the same logic to generate the test than the code under test, so assertions were of course always matching)

It also includes smaller changes / fixes:

Skipped items (duplicate) are logged only once per scrapper run instead of everytime they are encountered
All URLs whose scheme is not empty or http(s) are not rewritten at all (data, blob, tel, mailto, ftp, ...)
A small fix to escape '&' character in URL in test website]: fcaf1ad:
- Problem was Zimit2: Is there an HTML parser issue with some special characters? #219, still has to be investigated but not a dependency to merge this PR
A change in the fuzzy rule which removes "digits-only" query parameter: 60d174b
- The trailing ? does not provide any meaning, I suggest to remove it as well
- Browsers do not send a trailing ? when present in a URL, so WARC records won't be present with trailing ? as target URI
- Another approach would be to support empty querystring, but this would cause problems (we wouldn't be able to easily find the corresponding WARC record / ZIM item due to previous point about browser behavior)
A final (I hope) fix to properly ignore resource WARC records: 4068d85
- Work was supposed to already be done in Ignore resource WARC records for now #198 but work on this PR made me realize we have some code which was still executed, and even a test WARC archive

Test ZIMs

test website:
- ZIM is at https://tmp.kiwix.org/ci/test-warc/test_website_2024-03-18.zim
solidarité numérique:
- ZIM is at https://tmp.kiwix.org/ci/test-warc/solidarite-numerique_2024-03-18.zim
- still impacted by Zimit2: HTTP return codes are not handled properly #220 and onxxx link not rewritten #209
thales:
- ZIM is at https://tmp.kiwix.org/ci/test-warc/thalesdoc_en_all_2024-03-18.zim
- PDFs with space in name are now opening properly

codecov · 2024-03-18T12:38:45Z

Codecov Report

Attention: Patch coverage is 86.80556% with 19 lines in your changes are missing coverage. Please review.

Project coverage is 85.98%. Comparing base (f20d331) to head (83438d6).
Report is 1 commits behind head on warc2zim2.

Files	Patch %	Lines
src/warc2zim/url_rewriting.py	83.83%	9 Missing and 7 partials ⚠️
src/warc2zim/converter.py	91.66%	1 Missing and 2 partials ⚠️

Additional details and impacted files

@@              Coverage Diff              @@
##           warc2zim2     #218      +/-   ##
=============================================
- Coverage      87.55%   85.98%   -1.57%     
=============================================
  Files             13       13              
  Lines            980     1049      +69     
  Branches         179      195      +16     
=============================================
+ Hits             858      902      +44     
- Misses           102      116      +14     
- Partials          20       31      +11

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

…empty

…ation to support encoded URL and query strings

benoit74 · 2024-03-18T14:05:25Z

@Jaifroid FYI if you are interested in testing very early stage new zimit2 ZIMs

Jaifroid · 2024-03-18T16:49:49Z

Just want to say... wow! That's a lot of work! 🎉

Jaifroid · 2024-03-18T17:36:46Z

@benoit74 With a small adjustment PR (kiwix/kiwix-js-pwa#577), the Thales ZIM is now working well in the PWA. PDFs open, and offline video is working fine (this was a fear of mine, so congrats)! To test video on other readers just search for YouTube in the search bar. Will fix KJS with same small adjustment in due course. Hope to test the other ZIM soon. Many thanks!

benoit74 · 2024-03-18T19:43:29Z

@Jaifroid Thank you for the test and confirmation, and glad you found the Youtube video! And of course, glad that it is easy to support these in KJS and PWA readers! Unfortunately we still have significant other issues to fix until zimit2 reach an acceptable level, so you will get even more ZIMs to test in due time.

rgaudin

👏

Can you clarify role of self.indexed_urls now that we have self.existing_zim_paths?

src/warc2zim/url_rewriting.py

src/warc2zim/converter.py

Jaifroid · 2024-03-25T10:15:16Z

I've now tested the solidarité numérique ZIM mentioned in first post of this PR, and it is definitely an improvement, though I confirm the two bugs remaining that were mentioned there.

I don't think it's related, but I noticed this ZIM, at least in Kiwix JS, the PWA and Kiwix Desktop on Windows, seems to have a character encoding issue (see screenshot, and look at any character that should have an accent). I don't know if this happened with the zimit1 version (I can't find a zimit1 version of this ZIM either in the zimit directory nor in the development library). If it is an error with zimit2 due to the assumption that all OpenZIM archives are UTF-8 encoded, then we may need a new issue to handle different character sets in zimit2?

mgautierfr

The change in the expected value for relative path disturb me.
Now the relative paths have a extra ../ and changing that cannot be a small side effect.
I have the feeling that either the previous version or the new version can work, but not both. And the previous version was working pretty well (with relative path at least)...

src/warc2zim/url_rewriting.py

tests/test_css_rewriting.py

src/warc2zim/url_rewriting.py

benoit74 · 2024-03-26T14:37:37Z

Can you clarify role of self.indexed_urls now that we have self.existing_zim_paths?

Very good question!

indexed_urls is the list of entries really added to the ZIM, items are added in this list at the same moment we add an item to the ZIM, while existing_zim_paths is the list of expect ZIM paths, based on a first exploration of WARC content.

I think they both serve a different purpose, e.g. indexed_urls is used to create alias from WARC redirect entries if we have not already added an item inside the ZIM.

I however find both very confusing and badly named (even the new existing_zim_paths). Comparing the two sets would even have allowed us to anticipate issues like #220 where many items of existing_zim_paths never actually made it to the ZIM / indexed_urls.

I propose to postpone this topic for a next PR where I will rename these two lists, merge them with some additional status info and use them for detecting scraper issues.

benoit74 · 2024-03-27T09:38:06Z

Edit: I finally renamed indexed_urls to added_zim_items and existing_zim_paths (was known_urls) to expected_zim_items. It is indeed way clearer and probably sufficient for now (only impact is memory consumption, but we can live with it for few weeks)

First comment updated.

mgautierfr · 2024-03-27T10:59:01Z

I think they both serve a different purpose, e.g. indexed_urls is used to create alias from WARC redirect entries if we have not already added an item inside the ZIM.

This is also used to skip potential duplicated entries in the WARC.

The existing_zim_paths/known_urls/expected_zim_items is all the url/path we know about. All entries in the zim path and all relative links (more exactly, they absolute path once relative link is resolved) must be IN this set. Once we have generated it (in gather_informations), it is constant.

indexed_urls/added_zim_items is what have been actually added to the zim file. It is mutable and by definition, it is always a subset of the previous set.

This rule was needed most probably only because of a trailing ? in some URLs

mgautierfr

LGTM
I'm less sure about the last commit but we can indeed readd it later if needed. Time to move on with this PR.

benoit74 · 2024-04-04T16:04:39Z

@rgaudin may I merge or do you still need to review this?

@mgautierfr I agree about uncertainties regarding last commit, but my tests are successful and I hate code which "might be useful but we are not sure anymore"

Jaifroid · 2024-04-08T09:43:01Z

@benoit74 Many thanks for finalizing this! Do we perhaps need a new test ZIM (or ZIMs) based on the merged code? I need to ensure my reader-side code is doing the correct number of decode steps before extracting articles from the ZIM when handling links clicked by the user. Or else tell me if it is safe to test against the ZIMs in the first post of this PR.

benoit74 · 2024-04-08T11:12:52Z

I do not think we are yet at a stage where it is worth to test new ZIMs, we have many issues to address first (including wombat.js configuration that needs to be adapted as well).

Be sure I will create some in due time, we have all readers to test anyways, not only yours.

I prefer to spare my time in order to focus on fixing everything that needs to be, rather than creating new ZIMs and getting many feedbacks where I would probably too often respond "yes, I know, this is issue xxx".

Jaifroid · 2024-04-08T11:27:50Z

I prefer to spare my time in order to focus on fixing everything that needs to be, rather than creating new ZIMs and getting many feedbacks where I would probably too often respond "yes, I know, this is issue xxx".

OK, thanks, I understand! Note that I try to test issues I've found on Kiwix Serve and Kiwix Desktop, to corroborate, not just on the readers I'm responsible for. I'll await further advice.

benoit74 self-assigned this Mar 18, 2024

benoit74 changed the base branch from main to warc2zim2 March 18, 2024 10:02

benoit74 force-pushed the url_handling branch 4 times, most recently from 4216d66 to 17eabca Compare March 18, 2024 12:37

benoit74 added 4 commits March 18, 2024 14:03

Fix improperly escaped character in test-website

44a54d4

Rewrite fuzzy rule to not contain the trailing ? when querystring is …

def7bc1

…empty

Really do not consider 'resource' WARC record for all operations

40edf3c

Rework transformation of WARC record url to ZIM path and URL normaliz…

63bf7a9

…ation to support encoded URL and query strings

benoit74 force-pushed the url_handling branch from 576c5a7 to 63bf7a9 Compare March 18, 2024 14:03

benoit74 marked this pull request as ready for review March 18, 2024 14:03

benoit74 requested review from rgaudin and mgautierfr March 18, 2024 14:04

Jaifroid mentioned this pull request Mar 18, 2024

Workaround for HTML URLs with percent-encoded querystring separators in zimit2 kiwix/kiwix-js-pwa#577

Merged

rgaudin requested changes Mar 19, 2024

View reviewed changes

mgautierfr requested changes Mar 25, 2024

View reviewed changes

Add development directories to .gitignore

95492e5

benoit74 mentioned this pull request Mar 26, 2024

Zimit2: another encoding problem with solidarité-numérique #221

Closed

benoit74 mentioned this pull request Mar 26, 2024

Zimit2: add more resiliency / automatic detection of missing ZIM entries #222

Open

Jaifroid mentioned this pull request Mar 27, 2024

Zimit2: Fix URL encoding of ZIM items #206

Closed

benoit74 added 2 commits March 27, 2024 09:22

apply_fuzzy_rules always returns a str indeed

c6d5937

Fix typo

eeafb62

benoit74 added 2 commits March 27, 2024 09:46

Use PurePosixPath now that it can walk_up in Python 3.12

0b64bd4

Rename variables for clarity

18cf34f

benoit74 requested review from rgaudin and mgautierfr March 27, 2024 09:47

benoit74 mentioned this pull request Mar 27, 2024

Zimit2: Youtube videos are not working everywhere openzim/zimit#291

Closed

Remove extra Youtube rule which is not needed anymore

83438d6

This rule was needed most probably only because of a trailing ? in some URLs

mgautierfr approved these changes Mar 29, 2024

View reviewed changes

rgaudin approved these changes Apr 4, 2024

View reviewed changes

benoit74 merged commit 7809a6d into warc2zim2 Apr 5, 2024
4 of 6 checks passed

benoit74 deleted the url_handling branch April 5, 2024 07:00

benoit74 mentioned this pull request Apr 5, 2024

Links to server root are not working #210

Closed

benoit74 mentioned this pull request Apr 18, 2024

Problems with Zimit ZIM file format and URLs #86

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Revisit handling of special characters in ZIM / HTML URLs #218

Revisit handling of special characters in ZIM / HTML URLs #218

benoit74 commented Mar 18, 2024 •

edited

Loading

codecov bot commented Mar 18, 2024 •

edited

Loading

benoit74 commented Mar 18, 2024

Jaifroid commented Mar 18, 2024

Jaifroid commented Mar 18, 2024

benoit74 commented Mar 18, 2024

rgaudin left a comment

Jaifroid commented Mar 25, 2024

mgautierfr left a comment

benoit74 commented Mar 26, 2024

benoit74 commented Mar 27, 2024

mgautierfr commented Mar 27, 2024

mgautierfr left a comment

benoit74 commented Apr 4, 2024

Jaifroid commented Apr 8, 2024

benoit74 commented Apr 8, 2024

Jaifroid commented Apr 8, 2024

Revisit handling of special characters in ZIM / HTML URLs #218

Revisit handling of special characters in ZIM / HTML URLs #218

Conversation

benoit74 commented Mar 18, 2024 • edited Loading

Rationale

Changes

Test ZIMs

codecov bot commented Mar 18, 2024 • edited Loading

Codecov Report

benoit74 commented Mar 18, 2024

Jaifroid commented Mar 18, 2024

Jaifroid commented Mar 18, 2024

benoit74 commented Mar 18, 2024

rgaudin left a comment

Choose a reason for hiding this comment

Jaifroid commented Mar 25, 2024

mgautierfr left a comment

Choose a reason for hiding this comment

benoit74 commented Mar 26, 2024

benoit74 commented Mar 27, 2024

mgautierfr commented Mar 27, 2024

mgautierfr left a comment

Choose a reason for hiding this comment

benoit74 commented Apr 4, 2024

Jaifroid commented Apr 8, 2024

benoit74 commented Apr 8, 2024

Jaifroid commented Apr 8, 2024

benoit74 commented Mar 18, 2024 •

edited

Loading

codecov bot commented Mar 18, 2024 •

edited

Loading