-
Notifications
You must be signed in to change notification settings - Fork 6
Filters
Filters are applied before and after readability tries to extract the main content and can be used to improve or correct the detection on specific sites. In general the extraction algorithm of readability works pretty good, but sometimes it is unavoidable to use some kind of pre or post processing to fix false-positives and other problems.
Feedability supports 5 different types of rules per url pattern (regular expression against the article url), that are applied (either for pre or post-processing) in some way on specific HTML elements. The elements are specified using jQuery selectors. You can get documentation about them at the jQuery API documentation or at w3schools. The different types are:
Within a pre or post group.
Use regular expressions to replace specific content. The replace argument supports placeholders, currently the only supported is %{URL_Base}
, that gets replaced with the article url base.
Within a pre or post group.
Remove rules can be used to strip specific elements from the article html that are known to be causing false positive matches of main content by readability.
Within a pre or post group.
Elements that are selected by exclusive rules are replacing the body of the document. (so, currently it only makes sense to specify one element, but this may change in the future)
Needs to be outside pre/post.
Selected html by those rules are prepended to the final extracted text. This is useful for headings/tailings that are not included by readability.
Needs to be outside pre/post.
Selected html by those rules are appended to the final extracted text.
If you change the filter rules you need to remove the *.rdby
caching files to apply the new filters on already fetched articles. Example:
"rules": {
"sixserv.org": {
"pre": {
"remove": ["#sidebar", ".commentlist"],
"exclusive": ["#content"]
}
}
}
For a short tutorial on how to create remove filter rules read Filter-Tutorial.