Skip to content

Filter Tutorial

4poc edited this page Feb 17, 2011 · 7 revisions

Filter Tutorial

The filter is used to strip specific elements from the article html that are known to be causing false positive matches of main content by readability.

1. Detect False-Positive

If you encounter a wrong article text in a feedability feed that, for instance is a comment, part of the navigation or any other section of the website, you’ve found a false-positive of readability. This can for an example happen when the article text is not the largest coherent section of the page.

One very easy way to fix this, is to specifically remove the elements on the page that distract readability.

2. Select Elements

Feedability is using jQuery selectors to select the elements to be removed, they are a very powerful tool similar/identical(?) to XPath and CSS Selectors. You can find documentation about selectors at the jQuery API Documentation or at w3schools.

To find the right elements I use Firebug to view the structure of the page i encountered false-positives.

  1. First manually visit the article site and use the original Readability bookmarklet to verify the problem.
  2. Return to the original page by opening it up again.
  3. Now mark some of the text that is mistakenly detected as the main content. Open the context menu and select Inspect Element. Firebug should open up:
  4. Try to look around, highlight elements and navigate in the HTML tree and CSS classes. There many ways to specify the element that causing the problem, no sure formula exists. In this case the most promising way, to selecting the comment section that is distracting readability, is selecting the CSS class named “commentlist” that includes all article comments of the page.
  5. Specifing a CSS class as jquery selector is easy as pie: .commentlist (side note: you can select element ids using #id) In fact, firebug is showing the correct syntax already.
  6. You can also right-click on the element (<ol class="commentlist">) in the html view and select Delete Element, after that start the Readability bookmarklet again, to verify that it can extract the main content now.

3. Specify Own Filter Rules

To put it all together we create a user_settings.json file that should contain our own filtering rules:

{
  "filter": {
    "jquery_filters": {
      "sixserv.org": [".commentlist"]
    }
  }
}

Now, the filter is only applied to article sites that match with the regular expression sixserv.org¹.

Please not that if you change the filter rules you need to remove the *.rdby caching files to apply the new filters on already fetched articles.

¹ i know i don’t escaped the point

Clone this wiki locally