You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The paper notes that a synthesis is differentially private if it's built from differentially private parameters (in this case regression coefficients IIUC), and proposes an adaptation of other methods that sample from the distribution, relaxing a boundedness assumption. It cites Bowen and Liu (2018) "Comparative Study of Differentially Private Data Synthesis Methods", which I think would help me follow their approach better.
Their synthesis approach appears to be limited to parametric models; in case that's true and Bowen and Liu are also limited to parametric models, these other papers could be useful for our current nonparametric approaches:
To evaluate the quality of the synthesis, they propose stacking the synthesis and training sets, building a model to predict whether a record is synthesized, and summarize those probabilities as distances from 0.5:
The idea of distinguishing synthesized data from real data is interesting, and they use a CART model to do so.
I'm not sure how necessary the novel metric is, compared to established classification metrics like log-loss. This in-sample approach could also overfit. If we wanted to apply this, I'd want to consider log-loss on a holdout set.
The text was updated successfully, but these errors were encountered:
In our recent call with Benedetto and Stinson from Census, they recommended reading Snoke and Slavković (2018): "pMSE Mechanism: Differentially Private Synthetic Data with Maximal Distributional Similarity." This issue is to review relevant pieces of the paper regarding synthesis and evaluation.
Synthesis
The paper notes that a synthesis is differentially private if it's built from differentially private parameters (in this case regression coefficients IIUC), and proposes an adaptation of other methods that sample from the distribution, relaxing a boundedness assumption. It cites Bowen and Liu (2018) "Comparative Study of Differentially Private Data Synthesis Methods", which I think would help me follow their approach better.
Their synthesis approach appears to be limited to parametric models; in case that's true and Bowen and Liu are also limited to parametric models, these other papers could be useful for our current nonparametric approaches:
Evaluation
To evaluate the quality of the synthesis, they propose stacking the synthesis and training sets, building a model to predict whether a record is synthesized, and summarize those probabilities as distances from 0.5:
The idea of distinguishing synthesized data from real data is interesting, and they use a CART model to do so.
I'm not sure how necessary the novel metric is, compared to established classification metrics like log-loss. This in-sample approach could also overfit. If we wanted to apply this, I'd want to consider log-loss on a holdout set.
The text was updated successfully, but these errors were encountered: