Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Zimcheck internal URL checking seems to ignore URLencoding AND HTML entities #378

Closed
kelson42 opened this issue Oct 29, 2023 · 1 comment · Fixed by #383
Closed

Zimcheck internal URL checking seems to ignore URLencoding AND HTML entities #378

kelson42 opened this issue Oct 29, 2023 · 1 comment · Fixed by #383
Assignees
Labels
Milestone

Comments

@kelson42
Copy link
Contributor

One of the most important feature of zimcheck seems to be really buggy and weak. The checking of internal URL, ie. verifying that URLs in the HTML point to real entries in the ZIM, seem to just take the href value from the HTML and search it - as it - in the archive.

Which means that there will be an error wrongly returned if:

  • The URL is encoded, considering that the archive paths are not
  • The URL have legit HTML entities " or '

This is the last scenario which happen with this ZIM:
wikipedia_en_canada_2023-10.zim.zip

I got the error:

$ zimcheck wikipedia_en_canada_2023-10.zim 
[INFO] Checking zim file wikipedia_en_canada_2023-10.zim
[INFO] Zimcheck version is 3.2.0
[INFO] Verifying ZIM-archive structure integrity...
[INFO] Avoiding redundant checksum test (already performed by the integrity check).
[INFO] Checking metadata...
[INFO] Searching for Favicon...
[INFO] Searching for main page...
[INFO] Verifying Articles' content...
[INFO] Searching for redundant articles...
  Verifying Similar Articles for redundancies...
[INFO] Checking for redirect loops...
[WARNING] Redundant data found:
  -/File:"O_Canada",_performed_by_the_United_States_Third_Marine_Aircraft_Wing_Band.oga-pt-br.vtt and -/File:"O_Canada",_performed_by_the_United_States_Third_Marine_Aircraft_Wing_Band.oga-pt.vtt
[ERROR] Invalid internal links found:
  The following links:
- ../-/File:"O_Canada",_performed_by_the_United_States_Third_Marine_Aircraft_Wing_Band.oga-bg.vtt
(-/File:"O_Canada",_performed_by_the_United_States_Third_Marine_Aircraft_Wing_Band.oga-bg.vtt) were not found in article A/Canada
[INFO] Overall Test Status: Fail
[INFO] Total time taken by zimcheck: <3 seconds.
@kelson42 kelson42 added the bug label Oct 29, 2023
@kelson42 kelson42 added this to the 3.3.0 milestone Oct 29, 2023
@kelson42
Copy link
Contributor Author

kelson42 commented Oct 29, 2023

@veloman-yunkan @mgautierfr I'm very surprised to discover that hairy bug so late. Please confirm and possibility fix (should be complicated) ASAP. Actually by hardening the testing around MWoffliner, this bug has been discovered.

For the rest it seems to work and glad to merge and release in 3.3.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants