This repo modified the CTGAN official source code and made CTGAN conditional on any user-specified discrete columns for some research purposes.
To Do:
- How to ensure that the synthetic data have successfully been created by conditioning on user-specified discrete variables?
- The loss function will force every column in the fake data to be as similar as the real data. Will this impact the research purpose (i.e., conditioning on specific column(s) )?
- How to control whether the synthetic data should adhere to the same min/max boundaries set by the real data?
TO BE UPDATED: see section 1 in 01_tracing_source_code.ipynb
under the folder Tests_PAN
for details.
- 2024.11.01: Modified the source in data_sampler.py, data_transformer.py, and ctgan.py to make the overall codes can accept user-specified discrete column(s); ensure the modified module can run well.
The usage is similar to the original version of CTGAN; I added an extra parameter user_specified_col=
in class objects like CTGAN
and DataSampler.
To avoid future package conflicts, please uninstall the ctgan
library from your environment if you already installed it.
pip uninstall ctgan
Download this repo to your project file path, then you can use it as the orginal ctgan.
One example is:
# import the modules
from ctgan import CTGAN
from ctgan import load_demo
from ctgan.data_sampler import DataSampler
from ctgan.data_transformer import DataTransformer
import seaborn as sns
import matplotlib.pyplot as plt
# load the data
real_data = load_demo()
# specifiy all discrete columns
discrete_columns = [
'workclass',
'education',
'marital-status',
'occupation',
'relationship',
'race',
'sex',
'native-country',
'income'
]
# specificy the specific discrete column(s) you want to condition on
# this must be a list containing the strings of column names
controlled_col = ["sex", "race"]
# instantiate a CTGAN object
ctgan = CTGAN(epochs=50, cuda=True, verbose=True)
# fit ctgan model
ctgan.fit(real_data, discrete_columns, user_specified_col=controlled_col)
# plot the training loss
loss_df = ctgan.loss_values # Retrieve the loss DataFrame
plt.figure(figsize=(10, 6))
plt.plot(loss_df['Epoch'], loss_df['Generator Loss'], label='Generator Loss', color='blue')
plt.plot(loss_df['Epoch'], loss_df['Discriminator Loss'], label='Discriminator Loss', color='orange')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.title('CTGAN Training Loss Over Epochs')
plt.legend()
plt.grid(True)
plt.show()
# generate synthetic data
synthetic_data = ctgan.sample(32561)
By default, the parameter user_specified_col=None
. If you do not specify any columns, this function will run the same to the original ctgan.
You can run 01_tracing_source_code.ipynb
under the folder Tests_PAN
to see the performance.
CTGAN is a collection of Deep Learning based synthetic data generators for single table data, which are able to learn from real data and generate synthetic data with high fidelity.
Important Links | |
---|---|
💻 Website | Check out the SDV Website for more information about our overall synthetic data ecosystem. |
📙 Blog | A deeper look at open source, synthetic data creation and evaluation. |
📖 Documentation | Quickstarts, User and Development Guides, and API Reference. |
![]() |
The link to the Github Repository of this library. |
⌨️ Development Status | This software is in its Pre-Alpha stage. |
![]() |
Join our Slack Workspace for announcements and discussions. |
Currently, this library implements the CTGAN and TVAE models described in the Modeling Tabular data using Conditional GAN paper, presented at the 2019 NeurIPS conference.