Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Community learning discussion draft #2715

Open
wants to merge 9 commits into
base: master
Choose a base branch
from

Conversation

bcyphers
Copy link
Contributor

#1299

This is the first stab at adding community learning to Privacy Badger -- giving users the option to share de-identified information about the trackers they collect with EFF. Users can now choose to opt-in to community learning. Then, whenever Privacy Badger observes a new tracker for the first time, it will share the following information:

{
  page_host: full domain of the page that the user is on, 
  tracker_host: full domain of the third-party tracker, 
  tracking_action: type of action that Privacy Badger observed, one of ("cookie", "supercookie", "fingerprint").
}

Other things:

  • This is strictly opt-in, and no new data will be shared until a user checks the "community learning" box on the options page.
  • Community learning is separate from local learning: both can be enabled or disabled independently.
  • Community learning will not occur in incognito windows, even if local learning is enabled in incognito windows.
  • A random sample of new tracking actions will be reported. The proportion of new trackers reported is controlled by constants.CL_PROBABILITY. Right now it's set to 1.0, meaning every action will be reported.
  • To reduce network load, tracking actions will be reported after a random delay. The length of this delay is controlled by constants.MAX_CL_WAIT_TIME, currently set to 5 minutes.
  • Only trackers which are not already blocked or cookieblocked will be reported. Furthermore, only tracker/page pairs that are not already in the snitch map will be reported. If evil.com has a snitch map entry logging it tracking on example.com in the past, but evil.com is not yet blocked, the evil.com/example.com tracking action will not be reported.
  • To prevent duplicate reports, tracking actions that do get reported will be stored in an in-memory cache, HeuristicBlocker.previouslySharedTrackers. Tracking actions that are already present in the cache will not be reported again. Obviously, when the user restarts their browser, the cache will be cleared, but the goal is just to prevent crazy amounts of dupes. The size of the cache is capped by constants.CL_CACHE_SIZE.

This PR makes requests to "localhost:8080." If you want, you can set up a toy SQL server to test the logging.

On the server side, tracking actions will be stored in a SQL table that looks like this:

+----+---------------------+-----------------------------+-----------------------------+--------------+
| id | time                | page_host                   | tracker_host
| tracker_type |
+----+---------------------+-----------------------------+-----------------------------+--------------+
|  1 | 2020-10-28 10:38:48 | docs.python.org             | www.google.com
| cookie       |
|  2 | 2020-10-28 10:39:12 | docs.python.org             | adservice.google.com
| cookie       |
...

The server side is more up in the air, but for now, IP addresses and other identifying information will not be stored. Before this is deployed, we'll have some kind of privacy policy explaining exactly how logs etc. will be handled. We're still thinking through how best to prevent malicious reporting or other kinds of griefing.

We're thinking that the first version of community learning will be for informational purposes only -- the CL database will not automatically populate any block lists. This should reduce the incentive for bad actors to populate the database with bad data. We want to use the data to identify deficiencies in Badger Sett: what domains are we missing? on which domains do we need to crawl more widely? are there places where trackers don't appear unless a user logs in?

Eventually, it might make sense to auto-generate a community learning list, but for now, we just want to see what kind of data people are generating.

Feedback, criticism, concern are welcome. What do you think?

…is enabled

- Create new SettingsMap variable, shareLearning, and default to false
- pass tab_id to _recordPrevalence so that it can determine whether learning is enabled on a tab
- update isLearningEnabled so that it returns true if either local or community learning is enabled;
  let _recordPrevalence figure out which kind of learning should be done
- Create stub for function to share data with remote server
Since _recordPrevalence now checks for local learning enabled internally,
update BadgerStorage.merge() to modify the blocklist directly.
update some comments
@bcyphers bcyphers added enhancement privacy General privacy issues; stuff that isn't about Privacy Badger's heuristic labels Nov 12, 2020
@ablanathtanalba
Copy link
Contributor

This is great! And much appreciated how thought out it is from the get.
Here are some of my first thoughts while looking through this.

Starting at high level, definitely agreed that this DB for now should be used just for observational purposes, and no blocker lists should be made from it... yet.

@@ -173,6 +173,14 @@
"message": "Learn in Private/Incognito windows",
"description": "Checkbox label on the general settings page"
},
"options_community_learning_setting": {
"message": "Enable community learning and share data about trackers",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think changing this message to "Enable community learning and share data about trackers" would answer some questions that user might have without them having to read the lengthier warning message (which is great and should still be included on top of this)

if (Math.random() < constants.CL_PROBABILITY) {
// check if we've shared this tracker recently
// note that this check comes after checking against the snitch map
let tr_str = page_host + '+' + tracker_host + '+' + tracker_type;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding some string sanitization and/or sanity checking the input values in this method wouldn't hurt, since they're about to get launched off in a POST

"message": "Enable community learning and share data about trackers",
"description": "Checkbox label on the general settings page"
},
"options_community_learning_warning": {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the link that explains what community learning is (and the decision making that led to it) be included in this message or the one above?

@@ -40,6 +40,17 @@ var exports = {
TRACKING_THRESHOLD: 3,
MAX_COOKIE_ENTROPY: 12,

// The max amount of time (in milliseconds) that PB will wait before sharing a
// tracking action with EFF for community learning
MAX_CL_WAIT_TIME: 5 * 60 * 1000, // five minutes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call on reducing network load with the reporting timeouts, but I'm curious why 5 minutes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

completely arbitrary!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement privacy General privacy issues; stuff that isn't about Privacy Badger's heuristic
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants