Skip to content

A flexible CTGAN that can condition on user-specified discrete variable(s)

License

Notifications You must be signed in to change notification settings

cgpan/CTGAN_Flexible

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

A flexible CTGAN that can be customized by conditioning on user-specified variables(s)

0.0 Introduction

0.1 Purpose

This repo modified the CTGAN official source code and made CTGAN conditional on any user-specified discrete columns for some research purposes.

To Do:

  • How to ensure that the synthetic data have successfully been created by conditioning on user-specified discrete variables?
  • The loss function will force every column in the fake data to be as similar as the real data. Will this impact the research purpose (i.e., conditioning on specific column(s) )?
  • How to control whether the synthetic data should adhere to the same min/max boundaries set by the real data?

0.2 How did I modify the original CTGAN

TO BE UPDATED: see section 1 in 01_tracing_source_code.ipynb under the folder Tests_PAN for details.

0.3 Update memo

  • 2024.11.01: Modified the source in data_sampler.py, data_transformer.py, and ctgan.py to make the overall codes can accept user-specified discrete column(s); ensure the modified module can run well.

1.0 Usage

The usage is similar to the original version of CTGAN; I added an extra parameter user_specified_col= in class objects like CTGAN and DataSampler.

To avoid future package conflicts, please uninstall the ctgan library from your environment if you already installed it.

pip uninstall ctgan

Download this repo to your project file path, then you can use it as the orginal ctgan. One example is:

# import the modules
from ctgan import CTGAN
from ctgan import load_demo
from ctgan.data_sampler import DataSampler
from ctgan.data_transformer import DataTransformer
import seaborn as sns
import matplotlib.pyplot as plt

# load the data
real_data = load_demo()

# specifiy all discrete columns
discrete_columns = [
    'workclass',
    'education',
    'marital-status',
    'occupation',
    'relationship',
    'race',
    'sex',
    'native-country',
    'income'
]

# specificy the specific discrete column(s) you want to condition on
# this must be a list containing the strings of column names
controlled_col = ["sex", "race"]

# instantiate a CTGAN object
ctgan = CTGAN(epochs=50, cuda=True, verbose=True)

# fit ctgan model
ctgan.fit(real_data, discrete_columns, user_specified_col=controlled_col)

# plot the training loss
loss_df = ctgan.loss_values  # Retrieve the loss DataFrame

plt.figure(figsize=(10, 6))
plt.plot(loss_df['Epoch'], loss_df['Generator Loss'], label='Generator Loss', color='blue')
plt.plot(loss_df['Epoch'], loss_df['Discriminator Loss'], label='Discriminator Loss', color='orange')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.title('CTGAN Training Loss Over Epochs')
plt.legend()
plt.grid(True)
plt.show()

# generate synthetic data
synthetic_data = ctgan.sample(32561)

By default, the parameter user_specified_col=None. If you do not specify any columns, this function will run the same to the original ctgan.

You can run 01_tracing_source_code.ipynb under the folder Tests_PAN to see the performance.

2.0 SDV/CTGAN Official Resources

CTGAN is a collection of Deep Learning based synthetic data generators for single table data, which are able to learn from real data and generate synthetic data with high fidelity.

Important Links
💻 Website Check out the SDV Website for more information about our overall synthetic data ecosystem.
📙 Blog A deeper look at open source, synthetic data creation and evaluation.
📖 Documentation Quickstarts, User and Development Guides, and API Reference.
:octocat: Repository The link to the Github Repository of this library.
⌨️ Development Status This software is in its Pre-Alpha stage.
Community Join our Slack Workspace for announcements and discussions.

Currently, this library implements the CTGAN and TVAE models described in the Modeling Tabular data using Conditional GAN paper, presented at the 2019 NeurIPS conference.

About

A flexible CTGAN that can condition on user-specified discrete variable(s)

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published