Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

restricted summary stats #7619 #7642

Merged
merged 2 commits into from
Mar 2, 2021
Merged

Conversation

pdurbin
Copy link
Member

@pdurbin pdurbin commented Feb 26, 2021

What this PR does / why we need it:

In the future we want to offer differentially private summary statistics for restricted data which means that restricted data should not present full summary statistics (and related information). This pull request corrects this.

Which issue(s) this PR closes:

Closes #7619

Special notes for your reviewer:

Not particularly. See the release note and the code. I'm weirded out by the existing BadRequestException error handling in Access.java and WebApplicationExceptionHandler.java. I just accepted the pattern of putting a string in the former and the real message in the latter.

Suggestions on how to test this:

Please see the release note, especially:

  • At the dataset level, DDI exports no longer show "dataDscr" information for restricted files. There is only one version of this export and it is the version that's suitable for public consumption with the "dataDscr" information hidden for restricted files.
  • Similarly, at the dataset level, the DDI HTML Codebook no longer shows "dataDscr" information for restricted files.
  • At the file level, "dataDscr" information is no longer publicly available for restricted files. In practice, it was only possible to get this publicly via API (the download/access button was hidden).
  • At the file level, "dataDscr" (variable metadata) information can still be downloaded for restricted files if you have access to download the file.

Does this PR introduce a user interface change? If mockups are available, please link/include them here:

The DDI HTML Codebook will no longer show for restricted files summary statistics, names of variables, or descriptions of variables. Instead, you will only see the count of rows and columns, type of file, and the UNF like this:

Screen Shot 2021-02-26 at 10 56 36 AM

Is there a release notes update needed for this change?:

Yes, included.

Additional documentation:

Documentation was actually removed because we used to warn people (pull request #6620) about summary statistics being publicly available for restricted data. Now we don't need to.

- At the dataset level, DDI exports no longer show "dataDscr" information
for restricted files. There is only one version of this export and it is
the version that's suitable for public consumption with the "dataDscr"
information hidden for restricted files.
- Similarly, at the dataset level, the DDI HTML Codebook no longer shows
"dataDscr" information for restricted files.
- At the file level, "dataDscr" information is no longer publicly
available for restricted files. In practice, it was only possible to
get this publicly via API (the download/access button was hidden).
- At the file level, "dataDscr" (variable metadata) information can still
be downloaded for restricted files if you have access to download the file.
Copy link
Contributor

@scolapasta scolapasta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good. Once thing that is worth discussing - I had orginally assumed we were hiding summary stats, but it looks like it's the entire data description section, including variable names/labels. However we still index them and make them searchable. Should we be hiding those in the index similarly? Or should we consider leaving them here and just removing summary stats? (if the former, should we consider in scope or out of scope)?

@pdurbin
Copy link
Member Author

pdurbin commented Feb 26, 2021

I had orginally assumed we were hiding summary stats, but it looks like it's the entire data description section, including variable names/labels. However we still index them and make them searchable.

Good catch. Here's the data I'm testing with (these are my pets, yes my cats are huge)...

name,pounds,species
Marshall,40,dog
Tiger,17,cat
Panther,21,cat

... and variable names like "species" are still searchable:

Screen Shot 2021-02-26 at 2 32 44 PM

@djbrooke
Copy link
Contributor

djbrooke commented Mar 1, 2021

Thanks @pdurbin and @scolapasta. It's OK if we keep these in the index for now. We'll possibly need to revisit this as we support more sensitive data, but for the OpenDP MVP this will support the use case of not spoiling privacy (and making the tool useful in the first place :)). I'll move this to QA since it looks like the other feedback in the issue (regarding the move to bundle) is done.

@kcondon kcondon self-assigned this Mar 1, 2021
@kcondon
Copy link
Contributor

kcondon commented Mar 2, 2021

@pdurbin Things I've noticed:

  1. When first attempted to download DDI HTML on restricted file via ui, saw 404 error. After a couple more attempts it worked. Not sure whether that is a timing issue in export. DDI worked.
  2. Am not able to download variable metadata on restricted file though I pass api key for creator (admin). I am using the view file metadata endpoint.

Otherwise it appears to work as described. I can try testing 1. again.

@kcondon kcondon merged commit a08e315 into develop Mar 2, 2021
@kcondon kcondon deleted the 7619-restricted-summary-starts branch March 2, 2021 19:57
@djbrooke djbrooke added this to the 5.4 milestone Mar 3, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

No Summary Stats and Variable Names/Labels for Restricted Files
4 participants