A flexible CTGAN that can be customized by conditioning on user-specified variables(s)

0.0 Introduction

0.1 Purpose

This repo modified the CTGAN official source code and made CTGAN conditional on any user-specified discrete columns for some research purposes.

To Do:

How to ensure that the synthetic data have successfully been created by conditioning on user-specified discrete variables?
The loss function will force every column in the fake data to be as similar as the real data. Will this impact the research purpose (i.e., conditioning on specific column(s) )?
How to control whether the synthetic data should adhere to the same min/max boundaries set by the real data?

0.2 How did I modify the original CTGAN

TO BE UPDATED: see section 1 in 01_tracing_source_code.ipynb under the folder Tests_PAN for details.

0.3 Update memo

2024.11.01: Modified the source in data_sampler.py, data_transformer.py, and ctgan.py to make the overall codes can accept user-specified discrete column(s); ensure the modified module can run well.

1.0 Usage

The usage is similar to the original version of CTGAN; I added an extra parameter user_specified_col= in class objects like CTGAN and DataSampler.

To avoid future package conflicts, please uninstall the ctgan library from your environment if you already installed it.

pip uninstall ctgan

Download this repo to your project file path, then you can use it as the orginal ctgan. One example is:

# import the modules
from ctgan import CTGAN
from ctgan import load_demo
from ctgan.data_sampler import DataSampler
from ctgan.data_transformer import DataTransformer
import seaborn as sns
import matplotlib.pyplot as plt

# load the data
real_data = load_demo()

# specifiy all discrete columns
discrete_columns = [
    'workclass',
    'education',
    'marital-status',
    'occupation',
    'relationship',
    'race',
    'sex',
    'native-country',
    'income'
]

# specificy the specific discrete column(s) you want to condition on
# this must be a list containing the strings of column names
controlled_col = ["sex", "race"]

# instantiate a CTGAN object
ctgan = CTGAN(epochs=50, cuda=True, verbose=True)

# fit ctgan model
ctgan.fit(real_data, discrete_columns, user_specified_col=controlled_col)

# plot the training loss
loss_df = ctgan.loss_values  # Retrieve the loss DataFrame

plt.figure(figsize=(10, 6))
plt.plot(loss_df['Epoch'], loss_df['Generator Loss'], label='Generator Loss', color='blue')
plt.plot(loss_df['Epoch'], loss_df['Discriminator Loss'], label='Discriminator Loss', color='orange')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.title('CTGAN Training Loss Over Epochs')
plt.legend()
plt.grid(True)
plt.show()

# generate synthetic data
synthetic_data = ctgan.sample(32561)

By default, the parameter user_specified_col=None. If you do not specify any columns, this function will run the same to the original ctgan.

You can run 01_tracing_source_code.ipynb under the folder Tests_PAN to see the performance.

2.0 SDV/CTGAN Official Resources

CTGAN is a collection of Deep Learning based synthetic data generators for single table data, which are able to learn from real data and generate synthetic data with high fidelity.

Important Links
💻 Website	Check out the SDV Website for more information about our overall synthetic data ecosystem.
📙 Blog	A deeper look at open source, synthetic data creation and evaluation.
📖 Documentation	Quickstarts, User and Development Guides, and API Reference.
Repository	The link to the Github Repository of this library.
⌨️ Development Status	This software is in its Pre-Alpha stage.
Community	Join our Slack Workspace for announcements and discussions.

Currently, this library implements the CTGAN and TVAE models described in the Modeling Tabular data using Conditional GAN paper, presented at the 2019 NeurIPS conference.

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
Tests_PAN		Tests_PAN
ctgan		ctgan
examples		examples
.gitattributes		.gitattributes
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

A flexible CTGAN that can be customized by conditioning on user-specified variables(s)

0.0 Introduction

0.1 Purpose

0.2 How did I modify the original CTGAN

0.3 Update memo

1.0 Usage

2.0 SDV/CTGAN Official Resources

About

Releases

Packages

Languages

License

cgpan/CTGAN_Flexible

Folders and files

Latest commit

History

Repository files navigation

A flexible CTGAN that can be customized by conditioning on user-specified variables(s)

0.0 Introduction

0.1 Purpose

0.2 How did I modify the original CTGAN

0.3 Update memo

1.0 Usage

2.0 SDV/CTGAN Official Resources

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages