Decoding of HTML entities in links #383

veloman-yunkan · 2023-11-14T13:35:36Z

Fixes #378

URI-decoding of links extracted from the HTML was performed, so I only had to add handling of HTML entities. I did so only for the syntactically important characters &, <, > and ".

generic_getLinks() doesn't decode HTML entities. Besides it doens't parse HTML and therefore may extract false links.

Current version of normalize_link() discards the query and/or fragment components of a URL.

codecov · 2023-11-14T13:37:50Z

Codecov Report

Attention: 14 lines in your changes are missing coverage. Please review.

Comparison is base (b8a0a4c) 27.49% compared to head (a9d0e1c) 28.06%.

Files	Patch %	Lines
src/tools.cpp	56.25%	5 Missing and 9 partials ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #383      +/-   ##
==========================================
+ Coverage   27.49%   28.06%   +0.57%     
==========================================
  Files          26       26              
  Lines        2550     2576      +26     
  Branches     1356     1371      +15     
==========================================
+ Hits          701      723      +22     
- Misses       1368     1369       +1     
- Partials      481      484       +3

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

mgautierfr · 2023-11-14T16:22:03Z

Code LGTM.

But I have a doubt about limiting to the four &, <, > and ".
While I agree that the chars mostly encoded to not break html, we cannot assume we will not found other char html encoded.
In the same time the list of named character is pretty long, I agree we can wait & see if we need it.

@kelson42 ?

kelson42

@veloman-yunkan Any reason why ' has not been implemented like described in the ticket? The probably of something like href='http://www.kiwix.org/j'aime' seems quite high to me.

kelson42

@veloman-yunkan LGTM (thx for your quick fix)

veloman-yunkan · 2023-11-14T19:32:05Z

@veloman-yunkan Any reason why ' has not been implemented like described in the ticket? The probably of something like href='http://www.kiwix.org/j'aime' seems quite high to me.

@kelson42 My fault. Fixed now.

veloman-yunkan · 2023-11-15T07:59:34Z

@kelson42 Before merging, it always makes sense to have fixup commits (if any) squashed into their respective commits (via a rebase).

kelson42 · 2023-11-15T08:58:18Z

My bad, sorry

veloman-yunkan added 7 commits November 14, 2023 14:15

Slightly more meaningful tools.getLinks unit test

3f805c5

More readable tools.getLinks unit-test

0c0a5d1

Demonstrating shortcomings of generic_getLinks()

df1d32b

generic_getLinks() doesn't decode HTML entities. Besides it doens't parse HTML and therefore may extract false links.

Enter decodeHtmlEntities()

d4a0c13

generic_getLinks() decodes HTML entities

8542b7d

Testing of URI-decoding by normalize_link()

6f5bd55

Removed obsolete code from normalize_link()

56b7ce5

Current version of normalize_link() discards the query and/or fragment components of a URL.

veloman-yunkan requested review from mgautierfr and kelson42 November 14, 2023 13:35

kelson42 reviewed Nov 14, 2023

View reviewed changes

veloman-yunkan added 3 commits November 14, 2023 22:56

fixup! Enter decodeHtmlEntities()

73bd37a

fixup! Testing of URI-decoding by normalize_link()

b69ff3a

Increasing the coverage of generic_getLinks()

64fd8e6

kelson42 self-requested a review November 14, 2023 19:30

kelson42 approved these changes Nov 14, 2023

View reviewed changes

fixup! Increasing the coverage of generic_getLinks()

a9d0e1c

kelson42 merged commit d406de4 into main Nov 15, 2023
12 of 13 checks passed

kelson42 deleted the better_link_extraction branch November 15, 2023 05:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Decoding of HTML entities in links #383

Decoding of HTML entities in links #383

veloman-yunkan commented Nov 14, 2023

codecov bot commented Nov 14, 2023 •

edited

Loading

mgautierfr commented Nov 14, 2023

kelson42 left a comment •

edited

Loading

kelson42 left a comment

veloman-yunkan commented Nov 14, 2023

veloman-yunkan commented Nov 15, 2023

kelson42 commented Nov 15, 2023

Decoding of HTML entities in links #383

Decoding of HTML entities in links #383

Conversation

veloman-yunkan commented Nov 14, 2023

codecov bot commented Nov 14, 2023 • edited Loading

Codecov Report

mgautierfr commented Nov 14, 2023

kelson42 left a comment • edited Loading

Choose a reason for hiding this comment

kelson42 left a comment

Choose a reason for hiding this comment

veloman-yunkan commented Nov 14, 2023

veloman-yunkan commented Nov 15, 2023

kelson42 commented Nov 15, 2023

codecov bot commented Nov 14, 2023 •

edited

Loading

kelson42 left a comment •

edited

Loading