ContentExtractor: fix xpath query with concat/normalize #181

Kdecherf · 2018-12-02T18:16:46Z

The XPath query used to strip content from strip_id_or_class patterns may strip unrelated elements as it's not an exact match, e.g.:
<p class="p_related"> is incorrectly removed with strip_id_or_class: related

This PR fixes the query with concat and normalize-space as used broadly in site-config files.

coveralls · 2018-12-02T18:24:24Z

Coverage remained the same at 98.627% when pulling 01d6c36 on Kdecherf:fixstripid into cf56678 on j0k3r:master.

The XPath query used to strip content from strip_id_or_class patterns may strip unrelated elements as it's not an exact match, e.g.: <p class="p_related"> will be removed with strip_id_or_class: related This commit replaces this XPath query with concat and normalize-space as used broadly in site-config files. Signed-off-by: Kevin Decherf <kevin@kdecherf.com>

j0k3r · 2018-12-04T14:15:26Z

I'm ok with that.
But I'm wondering if it's not intented to be sure to catch more stuff 🤔

Anyway, the bug also exist in @fivefilters FullTextRss https://bitbucket.org/fivefilters/full-text-rss/src/b57a2b7eedf3e507ffe30a39e122e3f0dafccb59/libraries/content-extractor/ContentExtractor.php?at=master&fileviewer=file-view-default#ContentExtractor.php-379

Kdecherf · 2018-12-04T14:20:25Z

To be honest I would expect strip_id_or_class to be a complete match and not a 'lazy' one

fivefilters · 2018-12-04T14:40:44Z

That directive actually comes from Instapaper. In the early days, Marco (who used to run it) had a database of community maintained custom extraction rules. You can still find the blog post here: http://blog.instapaper.com/post/730281947

In the post there's an example:

body_node = //div[@id = 'story']
strip_id_or_class_substring: 'related'
strip_id_or_class_substring: 'tools'

I'm pretty sure when it actually was opened up for the community, body_node became body and strip_id_or_class_substring became strip_id_or_class. Unfortunately I don't know if there's archives of the old page with more info on these directives, but my guess is it was simply a rename and the substring match was intended. Which is why we implemented it that way. So I don't consider it a bug.

Kdecherf · 2018-12-07T23:56:05Z

Thanks for your feedback @fivefilters,

While I can understand that a strip_id_or_class_substring was actually for doing a substring match, the fact that strip_id_or_class is having the same behavior could be a semantic issue in my opinion.

Kdecherf force-pushed the fixstripid branch from d5f9941 to 32d20c8 Compare December 2, 2018 18:38

Kdecherf force-pushed the fixstripid branch from 32d20c8 to 01d6c36 Compare December 2, 2018 19:35

j0k3r merged commit 3456815 into j0k3r:master Dec 4, 2018

Kdecherf deleted the fixstripid branch December 4, 2018 14:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ContentExtractor: fix xpath query with concat/normalize #181

ContentExtractor: fix xpath query with concat/normalize #181

Kdecherf commented Dec 2, 2018 •

edited

Loading

coveralls commented Dec 2, 2018 •

edited

Loading

j0k3r commented Dec 4, 2018

Kdecherf commented Dec 4, 2018

fivefilters commented Dec 4, 2018 •

edited

Loading

Kdecherf commented Dec 7, 2018

ContentExtractor: fix xpath query with concat/normalize #181

ContentExtractor: fix xpath query with concat/normalize #181

Conversation

Kdecherf commented Dec 2, 2018 • edited Loading

coveralls commented Dec 2, 2018 • edited Loading

j0k3r commented Dec 4, 2018

Kdecherf commented Dec 4, 2018

fivefilters commented Dec 4, 2018 • edited Loading

Kdecherf commented Dec 7, 2018

Kdecherf commented Dec 2, 2018 •

edited

Loading

coveralls commented Dec 2, 2018 •

edited

Loading

fivefilters commented Dec 4, 2018 •

edited

Loading