Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve filtering of EU cookie notices #35

Open
4 tasks
sylvinus opened this issue Mar 13, 2016 · 1 comment
Open
4 tasks

Improve filtering of EU cookie notices #35

sylvinus opened this issue Mar 13, 2016 · 1 comment

Comments

@sylvinus
Copy link
Contributor

Cookie notices are more of an annoyance than regular boilerplate because they usually appear on top of the page and may pollute the snippets.

Right now we have very basic code to filter some of them, but we could use some of the lists at https://filterlists.com/ to filter more of them.

One big issue is the format of these lists though: they use CSS selectors, sometimes as complex as cofunds.co.uk###idrMasthead > .idrPageRow[style*='z-index:1']. We don't have a CSS selector engine at the moment and it's unclear if we could add one without a massive performance hit.

We may want to start by only using definitions by IDs and classes, which should take care of most cases.

Rough todo list:

  • Decide which lists to use depending on license, maintenance and coverage
  • Write a script to download, parse and store them (in a rocksdb database, like we do in the urlserver?)
  • Write a test, ideally with a dummy list like the others in the tests/testdata directory
  • Implement in cosrlib/document/html/
@indolering
Copy link

Shouldn't we be filtering ads entirely? Ads can (and are) abused to manipulate search ranking.

We don't have a CSS selector engine at the moment and it's unclear if we could add one without a massive performance hit.

In terms of performance issues, well, how are you planning on handling one-page-apps and other sites that require JS?

Decide which lists to use depending on license, maintenance and coverage

I checked all of the major ones and most of the regionals, most of them are under a CC or similar OSS license. A handful prohibit non-commercial use (which is fine since we are non-profit). Many of them don't mention licenses or usage restrictions but our usage should fall under "fair use". I've asked the FilterList maintainer to add a license attribute to the machine readable list so that we can keep tabs on it.

The FilterList about page mentions something about checking for updates, I've requested more information about this feature.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants