Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revisit handling of special characters in ZIM / HTML URLs #218

Merged
merged 10 commits into from
Apr 5, 2024

Conversation

benoit74
Copy link
Collaborator

@benoit74 benoit74 commented Mar 18, 2024

Fix #206
Fix #210

Rationale

Following openzim/libzim#865 (comment) experiments, it seems now clear that:

  • ZIM entry URL/Path must be fully decoded to be reachable from the reader
  • HTML URL must be fully encoded to be reachable from the reader (e.g. querystring not dropped by webserver before being passed to libzim)

Following webrecorder/browsertrix-crawler#492, it is also now clear that URLs found in WARC record (WARC-Target-URI) is always URL-encoded.

This PR implements required changes to match these two "new" understandings.

Changes

Main changes are done with commit 202d6c9:

  • completely revisit the logic to compute ZIM path and to rewrite URLs found in documents:
    • ZIM path is always fully decoded
      • hostname is puny-decoded
      • path and querystring are url decoded
    • Document URLs are fully url-encoded, except the fragment (which stays client-side anyway)
      • there is no more querystring, so it is dropped by kiwix-serve webserver or other clients
      • the ZIM entry is directly and properly addressed under all conditions
    • see explanations in src/warc2zim/url_rewriting.py (document beginning + normalize function)
  • known_urls argument / attribute is renamed to existing_zim_paths expected_zim_items (and same name is used in the whole codebase for clarity)
  • indexed_urls attribute is renamed to added_zim_items
  • rename reduce method to apply_fuzzy_rules (convey way more meaning / less confusion)
  • rename from_normalized method to get_document_uri
  • Add HttpUrl and ZimPath classes so that it is now way clearer when we are dealing with a URL and when we are dealing with a Path
    • many renaming + code adaptations linked to these two new classes
    • use these classes mainly in apply_fuzzy_rules, get_document_uri and normalize
  • Many tests have been rewritten:
    • some where assuming invalid values
    • some where doing to much logic to generate test cases while an exhaustive list is way easier to understand and ensures we are testing what we intend to test (some tests where using almost exactly the same logic to generate the test than the code under test, so assertions were of course always matching)

It also includes smaller changes / fixes:

  • Skipped items (duplicate) are logged only once per scrapper run instead of everytime they are encountered
  • All URLs whose scheme is not empty or http(s) are not rewritten at all (data, blob, tel, mailto, ftp, ...)
  • A small fix to escape '&' character in URL in test website]: fcaf1ad:
  • A change in the fuzzy rule which removes "digits-only" query parameter: 60d174b
    • The trailing ? does not provide any meaning, I suggest to remove it as well
    • Browsers do not send a trailing ? when present in a URL, so WARC records won't be present with trailing ? as target URI
    • Another approach would be to support empty querystring, but this would cause problems (we wouldn't be able to easily find the corresponding WARC record / ZIM item due to previous point about browser behavior)
  • A final (I hope) fix to properly ignore resource WARC records: 4068d85

Test ZIMs

@benoit74 benoit74 self-assigned this Mar 18, 2024
@benoit74 benoit74 changed the base branch from main to warc2zim2 March 18, 2024 10:02
@benoit74 benoit74 force-pushed the url_handling branch 4 times, most recently from 4216d66 to 17eabca Compare March 18, 2024 12:37
Copy link

codecov bot commented Mar 18, 2024

Codecov Report

Attention: Patch coverage is 86.80556% with 19 lines in your changes are missing coverage. Please review.

Project coverage is 85.98%. Comparing base (f20d331) to head (83438d6).
Report is 1 commits behind head on warc2zim2.

Files Patch % Lines
src/warc2zim/url_rewriting.py 83.83% 9 Missing and 7 partials ⚠️
src/warc2zim/converter.py 91.66% 1 Missing and 2 partials ⚠️
Additional details and impacted files
@@              Coverage Diff              @@
##           warc2zim2     #218      +/-   ##
=============================================
- Coverage      87.55%   85.98%   -1.57%     
=============================================
  Files             13       13              
  Lines            980     1049      +69     
  Branches         179      195      +16     
=============================================
+ Hits             858      902      +44     
- Misses           102      116      +14     
- Partials          20       31      +11     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@benoit74 benoit74 marked this pull request as ready for review March 18, 2024 14:03
@benoit74
Copy link
Collaborator Author

@Jaifroid FYI if you are interested in testing very early stage new zimit2 ZIMs

@Jaifroid
Copy link

Just want to say... wow! That's a lot of work! 🎉

@Jaifroid
Copy link

@benoit74 With a small adjustment PR (kiwix/kiwix-js-pwa#577), the Thales ZIM is now working well in the PWA. PDFs open, and offline video is working fine (this was a fear of mine, so congrats)! To test video on other readers just search for YouTube in the search bar. Will fix KJS with same small adjustment in due course. Hope to test the other ZIM soon. Many thanks!

@benoit74
Copy link
Collaborator Author

@Jaifroid Thank you for the test and confirmation, and glad you found the Youtube video! And of course, glad that it is easy to support these in KJS and PWA readers! Unfortunately we still have significant other issues to fix until zimit2 reach an acceptable level, so you will get even more ZIMs to test in due time.

Copy link
Member

@rgaudin rgaudin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👏

Can you clarify role of self.indexed_urls now that we have self.existing_zim_paths?

src/warc2zim/url_rewriting.py Show resolved Hide resolved
src/warc2zim/url_rewriting.py Show resolved Hide resolved
src/warc2zim/url_rewriting.py Outdated Show resolved Hide resolved
src/warc2zim/url_rewriting.py Outdated Show resolved Hide resolved
src/warc2zim/converter.py Outdated Show resolved Hide resolved
@Jaifroid
Copy link

I've now tested the solidarité numérique ZIM mentioned in first post of this PR, and it is definitely an improvement, though I confirm the two bugs remaining that were mentioned there.

I don't think it's related, but I noticed this ZIM, at least in Kiwix JS, the PWA and Kiwix Desktop on Windows, seems to have a character encoding issue (see screenshot, and look at any character that should have an accent). I don't know if this happened with the zimit1 version (I can't find a zimit1 version of this ZIM either in the zimit directory nor in the development library). If it is an error with zimit2 due to the assumption that all OpenZIM archives are UTF-8 encoded, then we may need a new issue to handle different character sets in zimit2?

image

Copy link
Contributor

@mgautierfr mgautierfr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change in the expected value for relative path disturb me.
Now the relative paths have a extra ../ and changing that cannot be a small side effect.
I have the feeling that either the previous version or the new version can work, but not both. And the previous version was working pretty well (with relative path at least)...

src/warc2zim/url_rewriting.py Show resolved Hide resolved
src/warc2zim/url_rewriting.py Outdated Show resolved Hide resolved
src/warc2zim/url_rewriting.py Show resolved Hide resolved
src/warc2zim/url_rewriting.py Outdated Show resolved Hide resolved
tests/test_css_rewriting.py Show resolved Hide resolved
src/warc2zim/url_rewriting.py Show resolved Hide resolved
@benoit74
Copy link
Collaborator Author

Can you clarify role of self.indexed_urls now that we have self.existing_zim_paths?

Very good question!

indexed_urls is the list of entries really added to the ZIM, items are added in this list at the same moment we add an item to the ZIM, while existing_zim_paths is the list of expect ZIM paths, based on a first exploration of WARC content.

I think they both serve a different purpose, e.g. indexed_urls is used to create alias from WARC redirect entries if we have not already added an item inside the ZIM.

I however find both very confusing and badly named (even the new existing_zim_paths). Comparing the two sets would even have allowed us to anticipate issues like #220 where many items of existing_zim_paths never actually made it to the ZIM / indexed_urls.

I propose to postpone this topic for a next PR where I will rename these two lists, merge them with some additional status info and use them for detecting scraper issues.

@benoit74
Copy link
Collaborator Author

Edit: I finally renamed indexed_urls to added_zim_items and existing_zim_paths (was known_urls) to expected_zim_items. It is indeed way clearer and probably sufficient for now (only impact is memory consumption, but we can live with it for few weeks)

First comment updated.

@mgautierfr
Copy link
Contributor

I think they both serve a different purpose, e.g. indexed_urls is used to create alias from WARC redirect entries if we have not already added an item inside the ZIM.

This is also used to skip potential duplicated entries in the WARC.

The existing_zim_paths/known_urls/expected_zim_items is all the url/path we know about. All entries in the zim path and all relative links (more exactly, they absolute path once relative link is resolved) must be IN this set. Once we have generated it (in gather_informations), it is constant.

indexed_urls/added_zim_items is what have been actually added to the zim file. It is mutable and by definition, it is always a subset of the previous set.

This rule was needed most probably only because of a trailing ? in some
URLs
Copy link
Contributor

@mgautierfr mgautierfr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM
I'm less sure about the last commit but we can indeed readd it later if needed. Time to move on with this PR.

@benoit74
Copy link
Collaborator Author

benoit74 commented Apr 4, 2024

@rgaudin may I merge or do you still need to review this?

@mgautierfr I agree about uncertainties regarding last commit, but my tests are successful and I hate code which "might be useful but we are not sure anymore"

@benoit74 benoit74 merged commit 7809a6d into warc2zim2 Apr 5, 2024
4 of 6 checks passed
@benoit74 benoit74 deleted the url_handling branch April 5, 2024 07:00
@Jaifroid
Copy link

Jaifroid commented Apr 8, 2024

@benoit74 Many thanks for finalizing this! Do we perhaps need a new test ZIM (or ZIMs) based on the merged code? I need to ensure my reader-side code is doing the correct number of decode steps before extracting articles from the ZIM when handling links clicked by the user. Or else tell me if it is safe to test against the ZIMs in the first post of this PR.

@benoit74
Copy link
Collaborator Author

benoit74 commented Apr 8, 2024

I do not think we are yet at a stage where it is worth to test new ZIMs, we have many issues to address first (including wombat.js configuration that needs to be adapted as well).

Be sure I will create some in due time, we have all readers to test anyways, not only yours.

I prefer to spare my time in order to focus on fixing everything that needs to be, rather than creating new ZIMs and getting many feedbacks where I would probably too often respond "yes, I know, this is issue xxx".

@Jaifroid
Copy link

Jaifroid commented Apr 8, 2024

I prefer to spare my time in order to focus on fixing everything that needs to be, rather than creating new ZIMs and getting many feedbacks where I would probably too often respond "yes, I know, this is issue xxx".

OK, thanks, I understand! Note that I try to test issues I've found on Kiwix Serve and Kiwix Desktop, to corroborate, not just on the readers I'm responsible for. I'll await further advice.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Links to server root are not working Zimit2: Fix URL encoding of ZIM items
4 participants