Skip to content

Latest commit

 

History

History

pategan

Codebase for "PATE-GAN: Generating Synthetic Data with Differential Privacy Guarantees"

Authors: James Jordon, Jinsung Yoon, Mihaela van der Schaar

Reference: James Jordon, Jinsung Yoon, Mihaela van der Schaar, "PATE-GAN: Generating Synthetic Data with Differential Privacy Guarantees," International Conference on Learning Representations (ICLR), 2019.

Paper Link: https://openreview.net/forum?id=S1zk9iRqF7

Contact: jsyoon0823@gmail.com

This directory contains implementations of PATEGAN framework for generating synthetic data.

To run the pipeline for training and evaluation on PATEGAN framwork, simply run python3 -m main_pategan_experiment.py.

Note that hyper-parameter tuning is necessary for different datasets.

Code explanation

(1) data_generator.py

  • Generate train and test data to evaluate PATEGAN framework

(2) utils.py

  • Define various supervised models such as logistic regression
  • Return AUC and APR as the metrics

(3) pate_gan.py

  • Main PATEGAN framework
  • Return the synthetically generated data

(4) main_pategan_experiment.py

  • Report the prediction performances of original data and synthetic data generated by PATEGAN.

Command inputs:

  • data_no: number of generated data
  • data_dim: number of data dimensions
  • noise_rate: noise ratio on data
  • iterations: number of iterations for handling initialization randomness
  • n_s: the number of student training iterations
  • batch_size: the number of batch size for training student and generator
  • k: the number of teachers
  • epsilon: Differential privacy parameters (epsilon)
  • delta: Differential privacy parameters (delta)
  • lamda: PATE noise size

Note that hyper-parameters should be optimized for different datasets.

Example command

$ python3 main_pategan_experiment.py --data_no 10000 --data_dim 10 --noise_rate 1.0
--iterations 50 --n_s 1 --batch_size 64 --k 100 --epsilon 100 --delta 0.0001
--lamda 1.0 

Outputs

  • results: performances of Original and Synthetic performances
  • train_data: original data
  • synth_train_data: synthetically generated data