Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Paper review: "pMSE Mechanism: Differentially Private Synthetic Data with Maximal Distributional Similarity" #33

Open
MaxGhenis opened this issue Dec 29, 2018 · 0 comments

Comments

@MaxGhenis
Copy link
Collaborator

MaxGhenis commented Dec 29, 2018

In our recent call with Benedetto and Stinson from Census, they recommended reading Snoke and Slavković (2018): "pMSE Mechanism: Differentially Private Synthetic Data with Maximal Distributional Similarity." This issue is to review relevant pieces of the paper regarding synthesis and evaluation.

Synthesis

The paper notes that a synthesis is differentially private if it's built from differentially private parameters (in this case regression coefficients IIUC), and proposes an adaptation of other methods that sample from the distribution, relaxing a boundedness assumption. It cites Bowen and Liu (2018) "Comparative Study of Differentially Private Data Synthesis Methods", which I think would help me follow their approach better.

Their synthesis approach appears to be limited to parametric models; in case that's true and Bowen and Liu are also limited to parametric models, these other papers could be useful for our current nonparametric approaches:

Evaluation

To evaluate the quality of the synthesis, they propose stacking the synthesis and training sets, building a model to predict whether a record is synthesized, and summarize those probabilities as distances from 0.5:
image

The idea of distinguishing synthesized data from real data is interesting, and they use a CART model to do so.

I'm not sure how necessary the novel metric is, compared to established classification metrics like log-loss. This in-sample approach could also overfit. If we wanted to apply this, I'd want to consider log-loss on a holdout set.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant