Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Decoding of HTML entities in links #383

Merged
merged 11 commits into from
Nov 15, 2023
Merged

Decoding of HTML entities in links #383

merged 11 commits into from
Nov 15, 2023

Conversation

veloman-yunkan
Copy link
Collaborator

Fixes #378

URI-decoding of links extracted from the HTML was performed, so I only had to add handling of HTML entities. I did so only for the syntactically important characters &, <, > and ".

Copy link

codecov bot commented Nov 14, 2023

Codecov Report

Attention: 14 lines in your changes are missing coverage. Please review.

Comparison is base (b8a0a4c) 27.49% compared to head (a9d0e1c) 28.06%.

Files Patch % Lines
src/tools.cpp 56.25% 5 Missing and 9 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #383      +/-   ##
==========================================
+ Coverage   27.49%   28.06%   +0.57%     
==========================================
  Files          26       26              
  Lines        2550     2576      +26     
  Branches     1356     1371      +15     
==========================================
+ Hits          701      723      +22     
- Misses       1368     1369       +1     
- Partials      481      484       +3     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@mgautierfr
Copy link
Collaborator

Code LGTM.

But I have a doubt about limiting to the four &, <, > and ".
While I agree that the chars mostly encoded to not break html, we cannot assume we will not found other char html encoded.
In the same time the list of named character is pretty long, I agree we can wait &amp; see if we need it.

@kelson42 ?

Copy link
Contributor

@kelson42 kelson42 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@veloman-yunkan Any reason why &apos; has not been implemented like described in the ticket? The probably of something like href='http://www.kiwix.org/j&apos;aime' seems quite high to me.

@kelson42 kelson42 self-requested a review November 14, 2023 19:30
Copy link
Contributor

@kelson42 kelson42 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@veloman-yunkan LGTM (thx for your quick fix)

@veloman-yunkan
Copy link
Collaborator Author

@veloman-yunkan Any reason why &apos; has not been implemented like described in the ticket? The probably of something like href='http://www.kiwix.org/j&apos;aime' seems quite high to me.

@kelson42 My fault. Fixed now.

@kelson42 kelson42 merged commit d406de4 into main Nov 15, 2023
12 of 13 checks passed
@kelson42 kelson42 deleted the better_link_extraction branch November 15, 2023 05:25
@veloman-yunkan
Copy link
Collaborator Author

@kelson42 Before merging, it always makes sense to have fixup commits (if any) squashed into their respective commits (via a rebase).

@kelson42
Copy link
Contributor

My bad, sorry

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Zimcheck internal URL checking seems to ignore URLencoding AND HTML entities
3 participants