Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What implementation change, if any, do we need to make given the deprecation of the US Privacy String? #60

Closed
SebastianZimmeck opened this issue Jul 25, 2023 · 23 comments
Assignees
Labels
core functionality New big feature exploration Explore adding a feature etc. omnibus An issue that covers multiple connected (smaller) sub-issues

Comments

@SebastianZimmeck
Copy link
Member

The US Privacy String (including the USPAPI) will be deprecated on September 30, 2023. The IAB is introducing their Global Privacy Platform (GPP) --- not to be confused with Global Privacy Control (GPC); the two are unrelated.

What would we need to change for our crawler to work in the GPP environment?

@katehausladen and @OliverWang13, can you look into that?

@SebastianZimmeck SebastianZimmeck added core functionality New big feature question Further information is requested labels Jul 25, 2023
@SebastianZimmeck
Copy link
Member Author

@katehausladen
Copy link
Collaborator

The short answer is that we will need to get the GPP string using the ping function from the CMP API. In the latest version of the CMP API (1.1), the ping function returns this information:
Screenshot 2023-07-28 at 9 23 11 PM
I think our extension should just get the string from the return object, and we can decode it in the analysis stage.

Some more info on GPP:
GPP merges privacy regulations from different regions into one format. The USPS (as well as other regional strings) will be replaced by the GPP string, which can contain information from multiple regions at once. The format of the GPP string is described in the diagram below, taken from a webinar about it:

Screenshot 2023-07-31 at 10 35 27 AM

These are the current sections to choose from.

Screenshot 2023-08-01 at 11 26 02 PM

Section 6 (uspv1) is the USPS we are used to, which will be deprecated. These new US sections contain much more information than the USPS (11-16 fields instead of 4 fields in the USPS). The new fields are listed below:

Field name US Sections that have this field
Version 7, 8, 9, 10, 11, 12
SharingNotice 7, 9, 10, 11, 12
SaleOptOutNotice 7, 8, 9, 10, 11, 12
SharingOptOutNotice 7, 8
TargetedAdvertisingOptOutNotice 7, 9, 10, 11, 12
SensitiveDataProcessingOptOutNotice 7, 11
SensitiveDataLimitUseNotice 7, 8
SaleOptOut 7, 8, 9, 10, 11, 12
SharingOptOut 7, 8
TargetedAdvertisingOptOut 7, 9, 10, 11, 12
SensitiveDataProcessing 7, 8, 9, 10, 11, 12
KnownChildSensitiveDataConsents 7, 8, 9, 10, 11, 12
PersonalDataConsents 7, 8
MspaCoveredTransaction 7, 8, 9, 10, 11, 12
MspaOptOutOptionMode 7, 8, 9, 10, 11, 12
MspaServiceProviderMode 7, 8, 9, 10, 11, 12

@SebastianZimmeck
Copy link
Member Author

Excellent!

@katehausladen
Copy link
Collaborator

I went through the sites in the paper that were compliant with GPC. From that list, these sites implemented GPP:
cnn.com
al.com
howstuffworks.com
mlive.com
nj.com
nydailynews.com
sciencedaily.com

Unfortunately, all of these sites use version 1.0 of the CMP API. In this version, the GPP string could be accessed using getGPPData rather than ping. For example, cnn.com returns this for ping

Screenshot 2023-08-02 at 10 39 46 AM

And this for getGPPData

Screenshot 2023-08-02 at 10 55 59 AM

This complicates things on our end a bit, as we can’t get the GPP string from the return object of ping until sites reflect version 1.1 of the CMP API.

First, I’m going to focus on successfully pinging the CMP API from the extension. Using this, I’ll try to identify more sites that implemented GPP (if they exist) and check which version of the CMP API they use. If we want to look at the contents of the GPP strings for sites that use version 1.0, I'll use getGPPData for now and then change to using ping once sites start to use version 1.1. The switch between these 2 functions should just be a name change.

@SebastianZimmeck
Copy link
Member Author

SebastianZimmeck commented Aug 2, 2023

First, I’m going to focus on successfully pinging the CMP API from the extension.

If we want to look at the contents of the GPP strings for sites that use version 1.0, I'll use getGPPData for now and then change to using ping once sites start to use version 1.1. The switch between these 2 functions should just be a name change.

Yes, that is good.

Could we also test every site with both getGPPData and ping, or is that computationally intensive, too confusing code, ... some other disadvantage?

@katehausladen
Copy link
Collaborator

Yes, that would also work.

@SebastianZimmeck
Copy link
Member Author

Yes, that would also work.

OK, let's explore that as well then. Especially, if there is a gradual change over a longer period of time it may be worth it to look for both (for a certain time, at least).

@katehausladen
Copy link
Collaborator

I think I may check the version using ping and then look for the GPP string in the appropriate place based on that result (i.e., look inside PingReturn for version 1.1 and check getGPPData for version 1.0). getGPPData should not exist as a function in version 1.1.

Does that sound ok? I'll go ahead and add columns for the GPP string before and after GPC.

@SebastianZimmeck
Copy link
Member Author

I think I may check the version using ping and then ...

But wouldn't ping fail, i.e., return nothing at all in case of version 1.0? Or are you saying:

  1. Try ping
  2. If any value is returned, continue v1.1 logic; if nothing is returned, do getGPPData

@katehausladen
Copy link
Collaborator

No, ping is a function in both versions, but the attributes it returns are different based on the version. The version is returned by both versions of ping. We use the version to determine where to look for the gppString. Basically, version 1.1 merges version 1.0's getGPPData and ping into one ping function.

Here's a visual that might help:
Screenshot 2023-08-03 at 4 39 45 PM

@SebastianZimmeck
Copy link
Member Author

Ah, very enlightening, indeed. :)

Yes, the ping approach makes sense to me.

katehausladen added a commit that referenced this issue Aug 25, 2023
-updating rest api
-adding gpp columns + injection script to check for gpp
-adding a counter so that the site can't run analysis or halt analysis twice. (some sites were loading multiple times which resulted in multiple entries)
@katehausladen
Copy link
Collaborator

I crawled the last 774 sites of the initial set of ~2000 to test the last commit. It took ~32 s per site. The success rate was 98% when insecure certificates and error pages were excluded. I believe the rest of the failed sites were due to sites not loading (either initially or on the reload). 74 sites in this set had gpp implemented.

Some details on the commit:
The version differences (v 1.0 vs v 1.1) also affected the injected script for gpp set in contentScripts.js.
Screenshot 2023-08-24 at 11 47 50 PM
As shown in the picture, all default gpp functions (including ping and getGPPData, the ones we are using) used return values in version 1.0 but now have callback functions in version 1.1. All sites with v 1.0 return values as expected, but some execute callback functions and some don’t.
So, sites fall into 3 categories:

  • v 1.1: callback only
  • v 1.0: executes callback and returns value
  • v 1.0: returns value only

The injection script in the commit accommodates all 3 categories. For all sites that have implemented v 1.1, if you call a default function (I.e. ping or getGPPData) without a callback, it will use v 1.0 and return a value. However, I wanted v 1.1 to take precedence over v 1.0, so I wrote the injection script such that a callback takes precedence over a return value. Eventually, when v 1.0 stops being used, that injection script can be changed to rely exclusively on callbacks.

Another thing I added was a counter for how many times analysis was started and stopped for each domain. With the counter, we can prevent analysis from starting or stopping twice (or more) in a row for a particular domain. I added this because I was getting multiple entries for some sites. I realized that the sites were somehow loading more than once (or at least the event listener that responds to “load” events is being triggered more than once), which was triggering the extension to start or stop analysis more than once for a particular domain. (The load events happen close enough together that the extension has not updated the variable that stores whether analysis is running). Out of the 774 sites I crawled, these are the sites that loaded multiple times: https://www.furniturerow.com/, https://www.ultrasurfing.com/, https://www.wm.com/, https://www.fee.org/, https://www.ammoland.com/, https://www.upmc.com/. It seems rare enough that it could just be a site issue, and with the counter, analysis runs normally.

@SebastianZimmeck
Copy link
Member Author

The success rate was 98%

The injection script in the commit accommodates all 3 categories.

With the counter, we can prevent analysis from starting or stopping twice (or more) in a row for a particular domain.

Excellent!

@katehausladen
Copy link
Collaborator

The IAB has a website that encodes and decodes GPP strings. The site uses this JS/TS package to do the encoding and decoding process. The full package cannot run locally on a computer, as it must be able to access a window object in the browser. Since the package uses an Apache 2.0 license, I decided to just start from the library, remove all code that we won't need to use (we only need to decode), and convert it to python so it can easily integrate with our existing colab notebooks. The python code is now here in drive, and there is example usage inside the Processing_Analysis_Data colab.

@SebastianZimmeck SebastianZimmeck added exploration Explore adding a feature etc. and removed question Further information is requested labels Sep 12, 2023
@katehausladen
Copy link
Collaborator

The last test I did on GPP (of ~700 sites) found 76 sites with GPP strings. The graph below shows how frequently various combinations of sections were implemented in the GPP strings I found.
Screenshot 2023-09-14 at 2 38 23 PM
As shown, there are a lot of sites in the 'none' section. This means that either the GPP string was empty or it only had the TCF EU/canada sections. Most sites are implementing some combination of uspv1 (the old US privacy string) and usnatv1 (the union of all the fields of individual states, hence covering all state regulations by implementing this section).

As we discussed in the meeting, the states separate different types of information disclosure (i.e. Sale, Sharing, Targeted Advertising). CA's section uses Sale and Sharing, while all other states use Sale and Targeted Advertising. The national section includes all three. The values N/A, opted out, and not opted out correspond to the value inputs as described by IAB
Screenshot 2023-09-14 at 3 08 43 PM

I separated the different opt outs into different graphs, shown below.
Screenshot 2023-09-14 at 2 56 16 PM
Screenshot 2023-09-14 at 2 56 27 PM
Screenshot 2023-09-14 at 2 56 39 PM

@SebastianZimmeck
Copy link
Member Author

This means that either the GPP string was empty or it only had the TCF EU/canada sections.

I am wondering if we should also consider the TCF EU/Canada sections as well. I could imagine some sites saying, "if we are receiving a GPC (or other privacy preference signal), let's just opt out the user everywhere."

Most sites are implementing some combination of uspv1 (the old US privacy string) and usnatv1 (the union of all the fields of individual states, hence covering all state regulations by implementing this section).

What is the usvav1 string that 2/76 sites implement?

As we discussed in the meeting, the states separate different types of information disclosure (i.e. Sale, Sharing, Targeted Advertising).

So, there is no "profiling" despite being defined in some some laws?

I separated the different opt outs into different graphs, shown below.

Nice analysis!

@SebastianZimmeck SebastianZimmeck changed the title What change, if any, do we need to make given the deprecation of the US Privacy String? What implementation change, if any, do we need to make given the deprecation of the US Privacy String? Sep 19, 2023
@SebastianZimmeck SebastianZimmeck added the omnibus An issue that covers multiple connected (smaller) sub-issues label Sep 19, 2023
@katehausladen
Copy link
Collaborator

katehausladen commented Sep 19, 2023

I am wondering if we should also consider the TCF EU/Canada sections as well. I could imagine some sites saying, "if we are receiving a GPC (or other privacy preference signal), let's just opt out the user everywhere."

I can work on that

What is the usvav1 string that 2/76 sites implement?

CA, CO, CT, UT, and VA each have their own section that has fields specific to that state's laws. usvav1 is the section for Virginia. Most sites just chose to implement usnatv1 because then they cover all state laws, as usnatv1 is the union of the fields of all the state sections.

So, there is no "profiling" despite being defined in some laws?

No, there is not any field for profiling. Here is a complete list of fields.

@katehausladen
Copy link
Collaborator

katehausladen commented Sep 20, 2023

TCF EU/Canada are now included in the decoding process. The updated code is in the drive.
Here is the revised version of the pie chart. It looks like old uspv1 only has split in half, with half of those sites implementing tcf eu and tcf Canada as well.
Screenshot 2023-09-20 at 4 00 10 PM

@SebastianZimmeck
Copy link
Member Author

Thanks, @katehausladen!

And, to clarify, we have the logic for capturing the TCF string in the crawler, right?

@katehausladen
Copy link
Collaborator

Currently, the crawler gets TCF strings via the GPP string (if those sections are included in the GPP string).

@SebastianZimmeck
Copy link
Member Author

Ah, yes, right, that is good!

@SebastianZimmeck
Copy link
Member Author

No, there is not any field for profiling.

There may be no profiling because maybe the opt out right does not cover profiling, which may be defined in the laws for other reasons, or, at least, does not cover the opt out right via privacy preferences signals. We would need to check the laws. (#59)

It may also simply the case that as GPP is under development the IAB has not yet added profiling.

katehausladen added a commit that referenced this issue Oct 3, 2023
removing excess files etc
@katehausladen katehausladen mentioned this issue Oct 3, 2023
@OliverWang13
Copy link
Collaborator

This has been merged and completed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core functionality New big feature exploration Explore adding a feature etc. omnibus An issue that covers multiple connected (smaller) sub-issues
Projects
None yet
Development

No branches or pull requests

3 participants