Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ContentExtractor: fix xpath query with concat/normalize #181

Merged
merged 1 commit into from
Dec 4, 2018

Conversation

Kdecherf
Copy link
Collaborator

@Kdecherf Kdecherf commented Dec 2, 2018

The XPath query used to strip content from strip_id_or_class patterns may strip unrelated elements as it's not an exact match, e.g.:
<p class="p_related"> is incorrectly removed with strip_id_or_class: related

This PR fixes the query with concat and normalize-space as used broadly in site-config files.

@coveralls
Copy link

coveralls commented Dec 2, 2018

Coverage Status

Coverage remained the same at 98.627% when pulling 01d6c36 on Kdecherf:fixstripid into cf56678 on j0k3r:master.

The XPath query used to strip content from strip_id_or_class patterns
may strip unrelated elements as it's not an exact match, e.g.:
<p class="p_related"> will be removed with strip_id_or_class: related

This commit replaces this XPath query with concat and normalize-space as
used broadly in site-config files.

Signed-off-by: Kevin Decherf <kevin@kdecherf.com>
@j0k3r
Copy link
Owner

j0k3r commented Dec 4, 2018

I'm ok with that.
But I'm wondering if it's not intented to be sure to catch more stuff 🤔

Anyway, the bug also exist in @fivefilters FullTextRss https://bitbucket.org/fivefilters/full-text-rss/src/b57a2b7eedf3e507ffe30a39e122e3f0dafccb59/libraries/content-extractor/ContentExtractor.php?at=master&fileviewer=file-view-default#ContentExtractor.php-379

@Kdecherf
Copy link
Collaborator Author

Kdecherf commented Dec 4, 2018

To be honest I would expect strip_id_or_class to be a complete match and not a 'lazy' one

@j0k3r j0k3r merged commit 3456815 into j0k3r:master Dec 4, 2018
@Kdecherf Kdecherf deleted the fixstripid branch December 4, 2018 14:35
@fivefilters
Copy link

fivefilters commented Dec 4, 2018

That directive actually comes from Instapaper. In the early days, Marco (who used to run it) had a database of community maintained custom extraction rules. You can still find the blog post here: http://blog.instapaper.com/post/730281947

In the post there's an example:

body_node = //div[@id = 'story']
strip_id_or_class_substring: 'related'
strip_id_or_class_substring: 'tools'

I'm pretty sure when it actually was opened up for the community, body_node became body and strip_id_or_class_substring became strip_id_or_class. Unfortunately I don't know if there's archives of the old page with more info on these directives, but my guess is it was simply a rename and the substring match was intended. Which is why we implemented it that way. So I don't consider it a bug.

@Kdecherf
Copy link
Collaborator Author

Kdecherf commented Dec 7, 2018

Thanks for your feedback @fivefilters,

While I can understand that a strip_id_or_class_substring was actually for doing a substring match, the fact that strip_id_or_class is having the same behavior could be a semantic issue in my opinion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants