diff --git a/2.0/404.html b/2.0/404.html new file mode 100644 index 00000000..65901415 --- /dev/null +++ b/2.0/404.html @@ -0,0 +1,912 @@ + + + +
+ + + + + + + + + + + + + +ydata-sdk
is available through PyPi, allowing an easy process of installation and integration with the data science programing environments (Google Colab, Jupyter Notebooks, Visual Studio Code, PyCharm) and stack (pandas
, numpy
, scikit-learn
).
Currently, the package supports python versions over 3.9 and up-to python 3.12, and can be installed in Windows, Linux or MacOS operating systems.
+Prior to the package installation, it is recommended the creation of a virtual or conda
environment:
The above command creates and activates a new environment called "synth-env" with Python version 3.12.X. In the new environment, you can then install ydata-sdk
:
+Installing ydata-synthetic – +5min – Step-by-step installation guide
+To install inside a Google Colab notebook, you can use the following:
+ +Make sure your Google Colab is running Python versions >=3.9, <=3.12
. Learn how to configure Python versions on Google Colab here.
YData-Synthetic
is an open-source package developed in 2020 with the primary goal of educating users about generative models for synthetic data generation.
+Designed as a collection of models, it was intended for exploratory studies and educational purposes.
+However, it was not optimized for the quality, performance, and scalability needs typically required by organizations.
We are now ydata-sdk!
+Even though the journey was fun, and we have learned a lot from the community it is now time to upgrade ydata-synthetic
.
Heading towards the future of synthetic data generation we recommend users to transition to ydata-sdk
, which provides a superior experience with enhanced performance,
+precision, and ease of use, making it the preferred tool for synthetic data generation and a perfect introduction to Generative AI.
Tabular data does not have a temporal dependence, and can be structured and organized in a table-like format, where features are represented in columns, whereas observations correspond to the rows.
+Additionally, tabular data usually comprises both numeric and categorical features. Numeric features are those that encode quantitative values, whereas categorical represent qualitative measurements. Categorical features can further divided in ordinal, binary or boolean, and nominal features.
+Learn more about synthesizing tabular data in this article, or check the quickstart guide to get started with the synthesization of tabular datasets.
+Time-series data exhibit a sequencial, temporal dependency between records, and may present a wide range of patterns and trends, including seasonality (patterns that repeat at calendar periods -- days, weeks, months -- such as holiday sales, for instance) or periodicity (patterns that repeat over time).
+Read more about generating time-series data in this article and check this quickstart guide to get started with time-series data synthesization.
+Multi-Table data or databases exhibit a referential behaviour between and database schema that is expected to be replicated and respected by the synthetic data generated. +Read more about database synthetic data generation in this article and check this quickstart guide for Multi-Table synthetic data generation +Time-series data exhibit a sequential, temporal dependency between records, and may present a wide range of patterns and trends, including seasonality (patterns that repeat at calendar periods -- days, weeks, months -- such as holiday sales, for instance) or periodicity (patterns that repeat over time).
+Validating the quality of synthetic data is essential to ensure its usefulness and privacy. YData Fabric provides tools for comprehensive synthetic data evaluation through:
+Profile Comparison Visualization: +Fabric delivers side-by-side visual comparisons of key data properties (e.g., distributions, correlations, and outliers) between synthetic and original datasets, allowing users to assess fidelity at a glance.
+PDF Report with Metrics: +Fabric generates a PDF report that includes key metrics to evaluate:
+Fidelity: How closely synthetic data matches the original.
+These tools ensure a thorough validation of synthetic data quality, making it reliable for real-world use.
+With the upcoming update of ydata-synthetic
to ydata-sdk
, users will now have access to a single API that automatically selects and optimizes
+the best generative model for their data. This streamlined approach eliminates the need to choose between
+various models manually, as the API intelligently identifies the optimal model based on the specific dataset and use case.
Instead of having to manually select from models such as:
+The new API handles model selection automatically, optimizing for the best performance in fidelity, utility, and privacy. +This significantly simplifies the synthetic data generation process, ensuring that users get the highest quality output without +the need for manual intervention and tiring hyperparameter tuning.
+ + + + + + + +Great Expectations is a Python-based open-source library for validating, documenting, and profiling your data. It helps you to maintain data quality and improve communication about data between teams. With Great Expectations, you can assert what you expect from the data you load and transform, and catch data issues quickly – Expectations are basically unit tests for your data.
+Expectations are assertions about your data. In Great Expectations, those assertions are expressed in a declarative language in the form of simple, human-readable Python methods. For example, in order to assert that you want values in a column passenger_count
in your dataset to be integers between 1 and 6, you can say:
Great Expectations then uses this statement to validate whether the column passenger_count
in a given table is indeed between 1 and 6, and returns a success or failure result. The library currently provides several dozen highly expressive built-in Expectations, and allows you to write custom Expectations.
Great Expectations renders Expectations to clean, human-readable documentation called Data Docs. These HTML docs contain both your Expectation Suites as well as your data validation results each time validation is run – think of it as a continuously updated data quality report.
+!!! note Outdated
+ From ydata-synthetic vx onwards this example will no longer work. Please check ydata-sdk
and synthetic data generation examples.
We recommend you create a virtual environment and install ydata-synthetic and great-expectations by running the following command on your terminal.
+ +In this example, we'll use CTGAN to synthesize samples from the Adult Census Income dataset:
+from pmlb import fetch_data
+
+from ydata_synthetic.synthesizers.regular import RegularSynthesizer
+from ydata_synthetic.synthesizers import ModelParameters, TrainParameters
+
+# Load data and define the data processor parameters
+data = fetch_data('adult')
+num_cols = ['age', 'fnlwgt', 'capital-gain', 'capital-loss', 'hours-per-week']
+cat_cols = ['workclass','education', 'education-num', 'marital-status', 'occupation', 'relationship', 'race', 'sex',
+ 'native-country', 'target']
+
+# Defining the training parameters
+batch_size = 500
+epochs = 500+1
+learning_rate = 2e-4
+beta_1 = 0.5
+beta_2 = 0.9
+
+ctgan_args = ModelParameters(batch_size=batch_size,
+ lr=learning_rate,
+ betas=(beta_1, beta_2))
+
+train_args = TrainParameters(epochs=epochs)
+synth = RegularSynthesizer(modelname='ctgan', model_parameters=ctgan_args)
+synth.fit(data=data, train_arguments=train_args, num_cols=num_cols, cat_cols=cat_cols)
+
+# Sample for the trained synthesizer and save the synthetic data
+synth_data = synth.sample(1000)
+synth_data.to_csv('data/adult_synthetic.csv', index=False)
+
Import the great_expectations
module, create a data context, and connect to your synthetic data:
import great_expectations as gx
+
+# Initialize data context
+context = gx.get_context()
+
+# Connect to the synthetic data
+validator = context.sources.pandas_default.read_csv(
+ "data/adult_synthetic.csv"
+)
+
You can create Expectation Suites by writing out individual statements, such as the ones below, by using Profilers and Data Assistants or even Custom Profilers.
+# Create expectations
+validator.expect_column_values_to_not_be_null("age")
+validator.expect_column_values_to_be_between("workclass", auto=True)
+validator.save_expectation_suite()
+
To validate your data, define a checkpoint and examine the data to determine if it matches the defined Expectations:
+# Validate the synthetic data
+checkpoint = context.add_or_update_checkpoint(
+ name="synthetic_data_checkpoint",
+ validator=validator,
+)
+
And use the following code to view an HTML representation of the Validation results:
+ + + + + + + + +Having a hard time installing ydata-sdk on your laptop? + +Installing and generating synthetic data with ydata-sdk –
+You still have questions about python versions or how to get started? Check this blogpost!
+If you are just starting, or you are dealing with something new we are always eager to help!
+Join us in the Data-Centric AI community Discord, we have a space reserved for all your questions about ydata-synthetic! Don't be shy 😳
+ + + + + + + +Depending on your use case, the downstream application of your synthetic data, and the characteristics of your original data, you will need to adjust your synthetisation process accordingly. That often involves performing a thorough data preparation and fitting your generation models appropriately.
+Tip
+For a use-case oriented UI experience, try YData Fabric. From an interactive and complete data profiling to an efficient synthetization, your data preparation process will be seamlessly adjusted to your data characteristics.
+The most appropriate metrics to evaluate the quality of your synthetic data are also dependent on the goal for which synthetic data will be used. Nevertheless, we may define three essential pillars for synthetic data quality: privacy, fidelity, and utility:
+Privacy refers to the ability of synthetic data to withhold any personal, private, or sensitive information, avoiding connections being drawn to the original data and preventing data leakage;
+Fidelity concerns the ability of the new data to preserve the properties of the original data (in other words, it refers to "how faithful, how precise" is the synthetic data in comparison to real data);
+Finally, utility relates to the downstream application where the synthetic data will be used: if the synthetization process is successful, the same insights should be derived from the new data as from the original data.
+For each of these components, several specific statistical measures can be evaluated.
+Abstract
+To learn more about how to define specific trade-offs between privacy, fidelity, and utility, check out this white paper on Synthetic Data Quality Metrics.
+Most issues with installations are usually associated with unsupported Python versions or misalignment between python environments and package requirements.
+Let’s see how you can get both right:
+Note that ydata-sdk
currently requires Python >=3.9, < 3.13 so if you're trying to run our code in Google Colab, then you need to update your Google Colab’s Python version accordingly. The same goes for your development environment.
A lot of troubleshooting arises due to misalignments between environments and package requirements. +Virtual Environments isolate your installations from the "global" environment so that you don't have to worry about conflicts.
+Using conda, creating a new environment is as easy as running this on your shell:
+ +Now you can open up your Python editor or Jupyter Lab and use the synth-env as your development environment, without having to worry about conflicting versions or packages between projects!
+No. This is an unrealistic expectation because the TimeGAN architecture is not meant to replicate the long-term behavior of your data.
+TimeGAN works with the concept of "windows": it learns to map the data distribution of short-term frames of time, within the time windows you provide. It also considers that those windows are independent of each other, so it cannot return a temporal pattern most people expect.
+That's not supported by this architecture itself, but there are others that allow for both short-term and long-term synthesization, as those available in YData Fabric.
+Abstract
+Learn more about how YData's Time-Series Synthetic Data Generation compare to TimeGAN in this dedicated post.
+Couldn't find what you need? Reach out to our dedicated team for a quick and syn-ple assistance!
+ + + + + + +Synthetic data is data that has been created artificially through computer simulation or that algorithms can generate to +take the place of real-world data. The data can be used as an alternative or supplement to real-world data when real-world +data is not readily available. It can also be used as a Machine Learning performance booster.
+The ydata-sdk package is a Python package developed by YData’s team that allows users to easily benefit from Generative AI +and generate synthetic data. The main goal of the package is to serve as a way for data +scientists to get familiar with synthetic data and its applications in real-world domains, as well as the potential of Generative AI.
+The ydata-sdk package provides different methods for generating synthetic tabular, time-series data as well as databases.
+The package also aims to facilitate the exploration and understanding of synthetic data generation methods!
+** YData's Enterprise feature
+This feature is only available for users of YData Fabric.
+Sign-up Fabric community and +try synthetic data generation from multiple tables or contact us for more informations.
+Multitable synthetic data enables the creation of large, diverse +datasets crucial for training robust machine learning models, algorithm testing, and addressing privacy concerns. It can be +crucial to enable proper data democratization within an organization.
+Nevertheless, the process of generating a full database or even several tables that share relations, can be particularly +challenging due to the necessity of preserving referential integrity across diverse tables and scale. This involves maintaining +realistic relationships between entities to mirror real-world scenarios accurately while being able to process large volumes +of data.
+YData Fabric offers a cutting-edge Synthetic data generation process that seamlessly integrates with your existing Relational databases. +By replicating the data's value and structure to a new target storage, Fabric delivers a wide range of benefits and use-cases. +These include reducing risk and improving compliance by substituting operational databases with synthetic databases for tests and development. It also enables QA teams to create comprehensive and more flexible testing scenarios.
+Explore Fabric multi-table synthesis capabilities:
+Outdated
+Note that this example won't work with the latest version of ydata-synthetic
.
Please check ydata-sdk
to see how to generate conditional synthetic data.
Using CGAN to generate tabular synthetic data:
+Real-world domains are often described by tabular data i.e., data that can be structured and organized in a table-like format, where features/variables are represented in columns, whereas observations correspond to the rows.
+CGAN is a deep learning model that combines GANs with conditional models to generate data samples based on specific conditions:
+Here’s an example of how to synthetize tabular data with CGAN using the Credit Card dataset:
+"""
+ CGAN architecture example file
+"""
+import pandas as pd
+from sklearn import cluster
+
+from ydata_synthetic.utils.cache import cache_file
+from ydata_synthetic.synthesizers import ModelParameters, TrainParameters
+from ydata_synthetic.synthesizers.regular import RegularSynthesizer
+
+#Read the original data and have it preprocessed
+data_path = cache_file('creditcard.csv', 'https://datahub.io/machine-learning/creditcard/r/creditcard.csv')
+data = pd.read_csv(data_path, index_col=[0])
+
+#Data processing and analysis
+num_cols = list(data.columns[ data.columns != 'Class' ])
+cat_cols = []
+
+print('Dataset columns: {}'.format(num_cols))
+sorted_cols = ['V14', 'V4', 'V10', 'V17', 'V12', 'V26', 'Amount', 'V21', 'V8', 'V11', 'V7', 'V28', 'V19',
+ 'V3', 'V22', 'V6', 'V20', 'V27', 'V16', 'V13', 'V25', 'V24', 'V18', 'V2', 'V1', 'V5', 'V15',
+ 'V9', 'V23', 'Class']
+processed_data = data[ sorted_cols ].copy()
+
+#For the purpose of this example we will only synthesize the minority class
+train_data = processed_data.loc[processed_data['Class'] == 1].copy()
+
+#Create a new class column using KMeans - This will mainly be useful if we want to leverage conditional GAN
+print("Dataset info: Number of records - {} Number of variables - {}".format(train_data.shape[0], train_data.shape[1]))
+algorithm = cluster.KMeans
+args, kwds = (), {'n_clusters':2, 'random_state':0}
+labels = algorithm(*args, **kwds).fit_predict(train_data[ num_cols ])
+
+fraud_w_classes = train_data.copy()
+fraud_w_classes['Class'] = labels
+
+#----------------------------
+# GAN Training
+#----------------------------
+
+#Define the Conditional GAN and training parameters
+noise_dim = 32
+dim = 128
+batch_size = 128
+beta_1 = 0.5
+beta_2 = 0.9
+
+log_step = 100
+epochs = 2 + 1
+learning_rate = 5e-4
+models_dir = '../cache'
+
+#Test here the new inputs
+gan_args = ModelParameters(batch_size=batch_size,
+ lr=learning_rate,
+ betas=(beta_1, beta_2),
+ noise_dim=noise_dim,
+ layers_dim=dim)
+
+train_args = TrainParameters(epochs=epochs,
+ cache_prefix='',
+ sample_interval=log_step,
+ label_dim=-1,
+ labels=(0,1))
+
+#create a bining
+fraud_w_classes['Amount'] = pd.cut(fraud_w_classes['Amount'], 5).cat.codes
+
+#Init the Conditional GAN providing the index of the label column as one of the arguments
+synth = RegularSynthesizer(modelname='cgan', model_parameters=gan_args)
+
+#Training the Conditional GAN
+synth.fit(data=fraud_w_classes, label_cols=["Class"], train_arguments=train_args, num_cols=num_cols, cat_cols=cat_cols)
+
+#Saving the synthesizer
+synth.save('creditcard_cgan_model.pkl')
+
+#Loading the synthesizer
+synthesizer = RegularSynthesizer.load('creditcard_cgan_model.pkl')
+
+#Sampling from the synthesizer
+cond_array = pd.DataFrame(100*[1], columns=['Class'])
+# Synthesizer samples are returned in the original format (inverse_transform of internal processing already took place)
+sample = synthesizer.sample(cond_array)
+
+print(sample)
+
Outdated
+Note that this example won't work with the latest version of ydata-synthetic
.
Please check ydata-sdk
to see how to generate synthetic data.
Using CRAMER GAN to generate tabular synthetic data:
+Real-world domains are often described by tabular data i.e., data that can be structured and organized in a table-like format, where features/variables are represented in columns, whereas observations correspond to the rows.
+CRAMER GAN is a variant of GAN that employs the Cramer distance as a measure of similarity between real and generated data distributions to improve training stability and enhance sample quality:
+ +Here’s an example of how to synthetize tabular data with CRAMER GAN using the Credit Card dataset:
+"""
+ CramerGAN python file example
+"""
+#Install ydata-synthetic lib
+# pip install ydata-synthetic
+import sklearn.cluster as cluster
+import numpy as np
+import pandas as pd
+
+from ydata_synthetic.utils.cache import cache_file
+from ydata_synthetic.synthesizers import ModelParameters, TrainParameters
+from ydata_synthetic.synthesizers.regular import RegularSynthesizer
+
+#Read the original data and have it preprocessed
+data_path = cache_file('creditcard.csv', 'https://datahub.io/machine-learning/creditcard/r/creditcard.csv')
+data = pd.read_csv(data_path, index_col=[0])
+
+#Data processing and analysis
+num_cols = list(data.columns[ data.columns != 'Class' ])
+cat_cols = ['Class']
+
+print('Dataset columns: {}'.format(num_cols))
+sorted_cols = ['V14', 'V4', 'V10', 'V17', 'V12', 'V26', 'Amount', 'V21', 'V8', 'V11', 'V7', 'V28', 'V19', 'V3', 'V22', 'V6', 'V20', 'V27', 'V16', 'V13', 'V25', 'V24', 'V18', 'V2', 'V1', 'V5', 'V15', 'V9', 'V23', 'Class']
+processed_data = data[ sorted_cols ].copy()
+
+#For the purpose of this example we will only synthesize the minority class
+train_data = processed_data.loc[processed_data['Class'] == 1].copy()
+
+#Create a new class column using KMeans - This will mainly be useful if we want to leverage conditional GAN
+print("Dataset info: Number of records - {} Number of variables - {}".format(train_data.shape[0], train_data.shape[1]))
+algorithm = cluster.KMeans
+args, kwds = (), {'n_clusters':2, 'random_state':0}
+labels = algorithm(*args, **kwds).fit_predict(train_data[ num_cols ])
+
+print( pd.DataFrame( [ [np.sum(labels==i)] for i in np.unique(labels) ], columns=['count'], index=np.unique(labels) ) )
+
+fraud_w_classes = train_data.copy()
+fraud_w_classes['Class'] = labels
+
+# GAN training
+#Define the GAN and training parameters
+noise_dim = 32
+dim = 128
+batch_size = 128
+
+log_step = 100
+epochs = 500+1
+learning_rate = 5e-4
+beta_1 = 0.5
+beta_2 = 0.9
+models_dir = '../cache'
+
+model_parameters = ModelParameters(batch_size=batch_size,
+ lr=learning_rate,
+ betas=(beta_1, beta_2),
+ noise_dim=noise_dim,
+ layers_dim=dim)
+
+train_args = TrainParameters(epochs=epochs,
+ sample_interval=log_step)
+
+#Training the CRAMERGAN model
+synth = RegularSynthesizer(modelname='cramer', model_parameters=model_parameters)
+synth.fit(data=train_data, train_arguments = train_args, num_cols = num_cols, cat_cols = cat_cols)
+
+#Saving the synthesizer to later generate new events
+synth.save(path='creditcard_cramergan_model.pkl')
+
+#########################################################
+# Loading and sampling from a trained synthesizer #
+#########################################################
+synth = RegularSynthesizer.load(path='creditcard_cramergan_model.pkl')
+#Sampling the data
+#Note that the data returned it is not inverse processed.
+data_sample = synth.sample(100000)
+
Outdated
+Note that this example won't work with the latest version of ydata-synthetic
.
Please check ydata-sdk
to see how to generate synthetic data.
Using CTGAN to generate tabular synthetic data:
+Real-world domains are often described by tabular data i.e., data that can be structured and organized in a table-like format, where features/variables are represented in columns, whereas observations correspond to the rows.
+Additionally, real-world data usually comprises both numeric and categorical features. Numeric features are those that encode quantitative values, whereas categorical represent qualitative measurements.
+CTGAN was specifically designed to deal with the challenges posed by tabular datasets, handling mixed (numeric and categorical) data:
+Here’s an example of how to synthetize tabular data with CTGAN using the Adult Census Income dataset:
+from pmlb import fetch_data
+
+from ydata_synthetic.synthesizers.regular import RegularSynthesizer
+from ydata_synthetic.synthesizers import ModelParameters, TrainParameters
+
+# Load data and define the data processor parameters
+data = fetch_data('adult')
+num_cols = ['age', 'fnlwgt', 'capital-gain', 'capital-loss', 'hours-per-week']
+cat_cols = ['workclass','education', 'education-num', 'marital-status', 'occupation', 'relationship', 'race', 'sex',
+ 'native-country', 'target']
+
+# Defining the training parameters
+batch_size = 500
+epochs = 500+1
+learning_rate = 2e-4
+beta_1 = 0.5
+beta_2 = 0.9
+
+ctgan_args = ModelParameters(batch_size=batch_size,
+ lr=learning_rate,
+ betas=(beta_1, beta_2))
+
+train_args = TrainParameters(epochs=epochs)
+synth = RegularSynthesizer(modelname='ctgan', model_parameters=ctgan_args)
+synth.fit(data=data, train_arguments=train_args, num_cols=num_cols, cat_cols=cat_cols)
+
+synth.save('adult_ctgan_model.pkl')
+
+#########################################################
+# Loading and sampling from a trained synthesizer #
+#########################################################
+synth = RegularSynthesizer.load('adult_ctgan_model.pkl')
+synth_data = synth.sample(1000)
+print(synth_data)
+
Generate the best synthetic data quality
+If you are having a hard time in ensuring that CTGAN returns the synthetic data quality that you need for your use-case +give it a try to YData Fabric Synthetic Data. +Fabric Synthetic Data generation is considered the best in terms of quality. +Read more about it in this benchmark.
+CTGAN, as any other Machine Learning model, requires optimization at the level of the data preparation as well as +hyperparameter tuning. Here follows a list of best-practices and tips to improve your synthetic data quality:
+Understand Your Data: +Thoroughly understand the characteristics and distribution of your original dataset before using CTGAN. +Identify important features, correlations, and patterns in the data. +Leverage ydata-profiling feature to automate the process of understanding your data.
+Data Preprocess: +Clean and preprocess your data to handle missing values, outliers, and other anomalies before training CTGAN. +Standardize or normalize numerical features to ensure consistent scales.
+Feature Engineering: +Create additional meaningful features that could improve the quality of the synthetic data.
+Optimize Model Parameters: +Experiment with CTGAN hyperparameters such as epochs, batch_size, and gen_dim to find the values that work best +for your specific dataset. +Fine-tune the learning rate for better convergence.
+Conditional Generation: +Leverage the conditional generation capabilities of CTGAN by specifying conditions for certain features if applicable. +Adjust the conditioning mechanism to enhance the relevance of generated samples.
+Handle Imbalanced Data: +If your original dataset is imbalanced, ensure that CTGAN captures the distribution of minority classes effectively. +Adjust sampling strategies if needed.
+Use Larger Datasets: +Train CTGAN on larger datasets when possible to capture a more comprehensive representation of the underlying data distribution.
+Outdated
+Note that this example won't work with the latest version of ydata-synthetic
.
Please check ydata-sdk
to see how to generate conditional synthetic data.
Using CWGAN-GP to generate tabular synthetic data:
+Real-world domains are often described by tabular data i.e., data that can be structured and organized in a table-like format, where features/variables are represented in columns, whereas observations correspond to the rows.
+CWGAN GP is a variant of GAN that incorporates conditional information to generate data samples, while leveraging the Wasserstein distance to improve training stability and sample quality:
+ +Here’s an example of how to synthetize tabular data with CWGAN-GP using the Credit Card dataset:
+"""
+ CramerGAN python file example
+"""
+#Install ydata-synthetic lib
+# pip install ydata-synthetic
+import sklearn.cluster as cluster
+import numpy as np
+import pandas as pd
+
+from ydata_synthetic.utils.cache import cache_file
+from ydata_synthetic.synthesizers import ModelParameters, TrainParameters
+from ydata_synthetic.synthesizers.regular import RegularSynthesizer
+
+#Read the original data and have it preprocessed
+data_path = cache_file('creditcard.csv', 'https://datahub.io/machine-learning/creditcard/r/creditcard.csv')
+data = pd.read_csv(data_path, index_col=[0])
+
+#Data processing and analysis
+num_cols = list(data.columns[ data.columns != 'Class' ])
+cat_cols = ['Class']
+
+print('Dataset columns: {}'.format(num_cols))
+sorted_cols = ['V14', 'V4', 'V10', 'V17', 'V12', 'V26', 'Amount', 'V21', 'V8', 'V11', 'V7', 'V28', 'V19', 'V3', 'V22', 'V6', 'V20', 'V27', 'V16', 'V13', 'V25', 'V24', 'V18', 'V2', 'V1', 'V5', 'V15', 'V9', 'V23', 'Class']
+processed_data = data[ sorted_cols ].copy()
+
+#For the purpose of this example we will only synthesize the minority class
+train_data = processed_data.loc[processed_data['Class'] == 1].copy()
+
+#Create a new class column using KMeans - This will mainly be useful if we want to leverage conditional GAN
+print("Dataset info: Number of records - {} Number of variables - {}".format(train_data.shape[0], train_data.shape[1]))
+algorithm = cluster.KMeans
+args, kwds = (), {'n_clusters':2, 'random_state':0}
+labels = algorithm(*args, **kwds).fit_predict(train_data[ num_cols ])
+
+print( pd.DataFrame( [ [np.sum(labels==i)] for i in np.unique(labels) ], columns=['count'], index=np.unique(labels) ) )
+
+fraud_w_classes = train_data.copy()
+fraud_w_classes['Class'] = labels
+
+# GAN training
+#Define the GAN and training parameters
+noise_dim = 32
+dim = 128
+batch_size = 128
+
+log_step = 100
+epochs = 500+1
+learning_rate = 5e-4
+beta_1 = 0.5
+beta_2 = 0.9
+models_dir = '../cache'
+
+model_parameters = ModelParameters(batch_size=batch_size,
+ lr=learning_rate,
+ betas=(beta_1, beta_2),
+ noise_dim=noise_dim,
+ layers_dim=dim)
+
+train_args = TrainParameters(epochs=epochs,
+ sample_interval=log_step)
+
+#Training the CRAMERGAN model
+synth = RegularSynthesizer(modelname='cramer', model_parameters=model_parameters)
+synth.fit(data=train_data, train_arguments = train_args, num_cols = num_cols, cat_cols = cat_cols)
+
+#Saving the synthesizer to later generate new events
+synth.save(path='creditcard_cramergan_model.pkl')
+
+#########################################################
+# Loading and sampling from a trained synthesizer #
+#########################################################
+synth = RegularSynthesizer.load(path='creditcard_cramergan_model.pkl')
+#Sampling the data
+#Note that the data returned it is not inverse processed.
+data_sample = synth.sample(100000)
+
Outdated
+Note that this example won't work with the latest version of ydata-synthetic
.
Please check ydata-sdk
to see how to generate synthetic data.
Using DRAGAN to generate tabular synthetic data:
+Real-world domains are often described by tabular data i.e., data that can be structured and organized in a table-like format, where features/variables are represented in columns, whereas observations correspond to the rows.
+DRAGAN is a GAN variant that uses a gradient penalty to improve training stability and mitigate mode collapse:
+Here’s an example of how to synthetize tabular data with DRAGAN using the Adult Census Income dataset:
+from pmlb import fetch_data
+
+from ydata_synthetic.synthesizers.regular import RegularSynthesizer
+from ydata_synthetic.synthesizers import ModelParameters, TrainParameters
+
+#Load data and define the data processor parameters
+data = fetch_data('adult')
+num_cols = ['age', 'fnlwgt', 'capital-gain', 'capital-loss', 'hours-per-week']
+cat_cols = ['workclass','education', 'education-num', 'marital-status', 'occupation', 'relationship', 'race', 'sex',
+ 'native-country', 'target']
+
+# DRAGAN training
+#Defining the training parameters of DRAGAN
+noise_dim = 128
+dim = 128
+batch_size = 500
+
+log_step = 100
+epochs = 500+1
+learning_rate = 1e-5
+beta_1 = 0.5
+beta_2 = 0.9
+models_dir = '../cache'
+
+gan_args = ModelParameters(batch_size=batch_size,
+ lr=learning_rate,
+ betas=(beta_1, beta_2),
+ noise_dim=noise_dim,
+ layers_dim=dim)
+
+train_args = TrainParameters(epochs=epochs,
+ sample_interval=log_step)
+
+synth = RegularSynthesizer(modelname='dragan', model_parameters=gan_args, n_discriminator=3)
+synth.fit(data = data, train_arguments = train_args, num_cols = num_cols, cat_cols = cat_cols)
+
+synth.save('adult_dragan_model.pkl')
+
+#########################################################
+# Loading and sampling from a trained synthesizer #
+#########################################################
+synthesizer = RegularSynthesizer.load('adult_dragan_model.pkl')
+synthesizer.sample(1000)
+
Outdated
+Note that this example won't work with the latest version of ydata-synthetic
.
Please check ydata-sdk
to see how to generate synthetic data.
Using GMMs to generate tabular synthetic data:
+Real-world domains are often described by tabular data i.e., data that can be structured and organized in a table-like +format, where features/variables are represented in columns, whereas observations correspond to the rows.
+Gaussian Mixture models (GMMs) are a type of probabilistic models. Probabilistic models can also be leveraged to generate +synthetic data. Particularly, the way GMMs are able to generate synthetic data, is by learning the original data distribution +while fitting it to a mixture of Gaussian distributions.
+Outdated
+Note that this example won't work with the latest version of ydata-synthetic
.
Please check ydata-sdk
to see how to generate synthetic data.
Using WGAN to generate tabular synthetic data:
+Real-world domains are often described by tabular data i.e., data that can be structured and organized in a table-like format, where features/variables are represented in columns, whereas observations correspond to the rows.
+WGAN is a variant of GAN that utilizes the Wasserstein distance to improve training stability and generate higher quality samples:
+Here’s an example of how to synthetize tabular data with WGAN using the Credit Card dataset:
+#Install ydata-synthetic lib
+# pip install ydata-synthetic
+import sklearn.cluster as cluster
+import pandas as pd
+import numpy as np
+
+from ydata_synthetic.utils.cache import cache_file
+from ydata_synthetic.synthesizers import ModelParameters, TrainParameters
+from ydata_synthetic.synthesizers.regular import RegularSynthesizer
+
+#Read the original data and have it preprocessed
+data_path = cache_file('creditcard.csv', 'https://datahub.io/machine-learning/creditcard/r/creditcard.csv')
+data = pd.read_csv(data_path, index_col=[0])
+
+#Data processing and analysis
+num_cols = list(data.columns[ data.columns != 'Class' ])
+cat_cols = ['Class']
+
+print('Dataset columns: {}'.format(num_cols))
+sorted_cols = ['V14', 'V4', 'V10', 'V17', 'V12', 'V26', 'Amount', 'V21', 'V8', 'V11', 'V7', 'V28', 'V19', 'V3', 'V22', 'V6', 'V20', 'V27', 'V16', 'V13', 'V25', 'V24', 'V18', 'V2', 'V1', 'V5', 'V15', 'V9', 'V23', 'Class']
+processed_data = data[ sorted_cols ].copy()
+
+#For the purpose of this example we will only synthesize the minority class
+train_data = processed_data.loc[processed_data['Class'] == 1].copy()
+
+print("Dataset info: Number of records - {} Number of variables - {}".format(train_data.shape[0], train_data.shape[1]))
+algorithm = cluster.KMeans
+args, kwds = (), {'n_clusters':2, 'random_state':0}
+labels = algorithm(*args, **kwds).fit_predict(train_data[ num_cols ])
+
+print( pd.DataFrame( [ [np.sum(labels==i)] for i in np.unique(labels) ], columns=['count'], index=np.unique(labels) ) )
+
+fraud_w_classes = train_data.copy()
+fraud_w_classes['Class'] = labels
+
+# GAN training
+#Define the GAN and training parameters
+noise_dim = 32
+dim = 128
+batch_size = 128
+
+log_step = 100
+epochs = 500+1
+learning_rate = 5e-4
+beta_1 = 0.5
+beta_2 = 0.9
+models_dir = '../cache'
+
+model_parameters = ModelParameters(batch_size=batch_size,
+ lr=learning_rate,
+ betas=(beta_1, beta_2),
+ noise_dim=noise_dim,
+ layers_dim=dim)
+
+train_args = TrainParameters(epochs=epochs,
+ sample_interval=log_step)
+
+test_size = 492 # number of fraud cases
+noise_dim = 32
+
+#Training the CRAMERGAN model
+synth = RegularSynthesizer(modelname='wgan', model_parameters=model_parameters, n_critic=10)
+synth.fit(data=train_data, train_arguments = train_args, num_cols = num_cols, cat_cols = cat_cols)
+
+#Saving the synthesizer to later generate new events
+synth.save(path='creditcard_wgan_model.pkl')
+
+#########################################################
+# Loading and sampling from a trained synthesizer #
+#########################################################
+synth = RegularSynthesizer.load(path='creditcard_wgan_model.pkl')
+
+#Sampling the data
+data_sample = synth.sample(100000)
+
Outdated
+Note that this example won't work with the latest version of ydata-synthetic
.
Please check ydata-sdk
to see how to generate synthetic data.
Using WGAN-GP to generate tabular synthetic data:
+Real-world domains are often described by tabular data i.e., data that can be structured and organized in a table-like format, where features/variables are represented in columns, whereas observations correspond to the rows.
+WGANGP is a variant of GAN that incorporates a gradient penalty term to enhance training stability and improve the diversity of generated samples:
+Here’s an example of how to synthetize tabular data with WGAN-GP using the Adult Census Income dataset:
+from pmlb import fetch_data
+
+from ydata_synthetic.synthesizers.regular import RegularSynthesizer
+from ydata_synthetic.synthesizers import ModelParameters, TrainParameters
+
+#Load data and define the data processor parameters
+data = fetch_data('adult')
+num_cols = ['age', 'fnlwgt', 'capital-gain', 'capital-loss', 'hours-per-week']
+cat_cols = ['workclass','education', 'education-num', 'marital-status', 'occupation', 'relationship', 'race', 'sex',
+ 'native-country', 'target']
+
+#Defining the training parameters
+noise_dim = 128
+dim = 128
+batch_size = 500
+
+log_step = 100
+epochs = 500+1
+learning_rate = [5e-4, 3e-3]
+beta_1 = 0.5
+beta_2 = 0.9
+models_dir = '../cache'
+
+gan_args = ModelParameters(batch_size=batch_size,
+ lr=learning_rate,
+ betas=(beta_1, beta_2),
+ noise_dim=noise_dim,
+ layers_dim=dim)
+
+train_args = TrainParameters(epochs=epochs,
+ sample_interval=log_step)
+
+synth = RegularSynthesizer(modelname='wgangp', model_parameters=gan_args, n_critic=2)
+synth.fit(data, train_args, num_cols, cat_cols)
+
+synth.save('adult_wgangp_model.pkl')
+
+#########################################################
+# Loading and sampling from a trained synthesizer #
+#########################################################
+synth = RegularSynthesizer.load('adult_wgangp_model.pkl')
+synth_data = synth.sample(1000)
+
Outdated
+Note that this example won't work with the latest version of ydata-synthetic
.
Please check ydata-sdk
to see how to generate synthetic time-series data.
Using DoppelGANger to generate synthetic time-series data:
+Although tabular data may be the most frequently discussed type of data, a great number of real-world domains — from traffic and daily trajectories to stock prices and energy consumption patterns — produce time-series data which introduces several aspects of complexity to synthetic data generation.
+Time-series data is structured sequentially, with observations ordered chronologically based on their associated timestamps or time intervals. It explicitly incorporates the temporal aspect, allowing for the analysis of trends, seasonality, and other dependencies over time.
+DoppelGANger is a model that uses a Generative Adversarial Network (GAN) framework to generate synthetic time series data by learning the underlying temporal dependencies and characteristics of the original data:
+Here’s an example of how to synthetize time-series data with DoppelGANger using the Measuring Broadband America dataset:
+"""
+ DoppelGANger architecture example file
+"""
+
+# Importing necessary libraries
+import pandas as pd
+from os import path
+import matplotlib.pyplot as plt
+from ydata_synthetic.synthesizers.timeseries import TimeSeriesSynthesizer
+from ydata_synthetic.synthesizers import ModelParameters, TrainParameters
+
+# Read the data
+mba_data = pd.read_csv("../../data/fcc_mba.csv")
+numerical_cols = ["traffic_byte_counter", "ping_loss_rate"]
+categorical_cols = [col for col in mba_data.columns if col not in numerical_cols]
+
+# Define model parameters
+model_args = ModelParameters(batch_size=100,
+ lr=0.001,
+ betas=(0.2, 0.9),
+ latent_dim=20,
+ gp_lambda=2,
+ pac=1)
+
+train_args = TrainParameters(epochs=400, sequence_length=56,
+ sample_length=8, rounds=1,
+ measurement_cols=["traffic_byte_counter", "ping_loss_rate"])
+
+# Training the DoppelGANger synthesizer
+if path.exists('doppelganger_mba'):
+ model_dop_gan = TimeSeriesSynthesizer.load('doppelganger_mba')
+else:
+ model_dop_gan = TimeSeriesSynthesizer(modelname='doppelganger', model_parameters=model_args)
+ model_dop_gan.fit(mba_data, train_args, num_cols=numerical_cols, cat_cols=categorical_cols)
+ model_dop_gan.save('doppelganger_mba')
+
+# Generate synthetic data
+synth_data = model_dop_gan.sample(n_samples=600)
+synth_df = pd.concat(synth_data, axis=0)
+
+# Create a plot for each measurement column
+plt.figure(figsize=(10, 6))
+
+plt.subplot(2, 1, 1)
+plt.plot(mba_data['traffic_byte_counter'].reset_index(drop=True), label='Real Traffic')
+plt.plot(synth_df['traffic_byte_counter'].reset_index(drop=True), label='Synthetic Traffic', alpha=0.7)
+plt.xlabel('Index')
+plt.ylabel('Value')
+plt.title('Traffic Comparison')
+plt.legend()
+plt.grid(True)
+
+plt.subplot(2, 1, 2)
+plt.plot(mba_data['ping_loss_rate'].reset_index(drop=True), label='Real Ping')
+plt.plot(synth_df['ping_loss_rate'].reset_index(drop=True), label='Synthetic Ping', alpha=0.7)
+plt.xlabel('Index')
+plt.ylabel('Value')
+plt.title('Ping Comparison')
+plt.legend()
+plt.grid(True)
+
+plt.tight_layout()
+plt.show()
+
Outdated
+Note that this example won't work with the latest version of ydata-synthetic
.
Please check ydata-sdk
to see how to generate synthetic time-series data.
YData Fabric offers advanced capabilities for time-series synthetic data generation, surpassing TimeGAN in terms of flexibility, +scalability, and ease of use. With YData Fabric, users can generate high-quality synthetic time-series data while benefiting from built-in data profiling tools +that ensure the integrity and consistency of the data. Unlike TimeGAN, which is a single model for time-series, YData Fabric offers a solution that is suitable for different types of datasets and behaviours. +Additionally, YData Fabric is designed for scalability, enabling seamless handling of large, complex time-series datasets. Its guided UI makes it easy to adapt to different time-series scenarios, +from healthcare to financial data, making it a more comprehensive and flexible solution for time-series data generation.
+For more on YData Fabric vs Synthetic data generation with TimeGAN read this blogpost.
+Although tabular data may be the most frequently discussed type of data, a great number of real-world domains — from traffic and daily trajectories to stock prices and energy consumption patterns — produce time-series data which introduces several aspects of complexity to synthetic data generation.
+Time-series data is structured sequentially, with observations ordered chronologically based on their associated timestamps or time intervals. It explicitly incorporates the temporal aspect, allowing for the analysis of trends, seasonality, and other dependencies over time.
+TimeGAN is a model that uses a Generative Adversarial Network (GAN) framework to generate synthetic time series data by learning the underlying temporal dependencies and characteristics of the original data:
+Here’s an example of how to synthetize time-series data with TimeGAN using the Yahoo Stock Price dataset:
+"""
+ TimeGAN architecture example file
+"""
+
+# Importing necessary libraries
+from os import path
+from ydata_synthetic.synthesizers.timeseries import TimeSeriesSynthesizer
+from ydata_synthetic.preprocessing.timeseries import processed_stock
+from ydata_synthetic.synthesizers import ModelParameters, TrainParameters
+import numpy as np
+import pandas as pd
+import matplotlib.pyplot as plt
+
+# Define model parameters
+gan_args = ModelParameters(batch_size=128,
+ lr=5e-4,
+ noise_dim=32,
+ layers_dim=128,
+ latent_dim=24,
+ gamma=1)
+
+train_args = TrainParameters(epochs=50000,
+ sequence_length=24,
+ number_sequences=6)
+
+# Read the data
+stock_data = pd.read_csv("../../data/stock_data.csv")
+cols = list(stock_data.columns)
+
+# Training the TimeGAN synthesizer
+if path.exists('synthesizer_stock.pkl'):
+ synth = TimeSeriesSynthesizer.load('synthesizer_stock.pkl')
+else:
+ synth = TimeSeriesSynthesizer(modelname='timegan', model_parameters=gan_args)
+ synth.fit(stock_data, train_args, num_cols=cols)
+ synth.save('synthesizer_stock.pkl')
+
+# Generating new synthetic samples
+stock_data_blocks = processed_stock(path='../../data/stock_data.csv', seq_len=24)
+synth_data = synth.sample(n_samples=len(stock_data_blocks))
+print(synth_data[0].shape)
+
+# Plotting some generated samples. Both Synthetic and Original data are still standartized with values between [0,1]
+fig, axes = plt.subplots(nrows=3, ncols=2, figsize=(15, 10))
+axes=axes.flatten()
+
+time = list(range(1,25))
+obs = np.random.randint(len(stock_data_blocks))
+
+for j, col in enumerate(cols):
+ df = pd.DataFrame({'Real': stock_data_blocks[obs][:, j],
+ 'Synthetic': synth_data[obs].iloc[:, j]})
+ df.plot(ax=axes[j],
+ title = col,
+ secondary_y='Synthetic data', style=['-', '--'])
+fig.tight_layout()
+
YData Fabric provides a robust, guided user interface (UI) specifically designed to streamline synthetic data generation. +This interface is tailored to support users at every level, ensuring that both novice users and experienced data scientists can efficiently generate +synthetic datasets while adhering to best practices.
+The YData Fabric UI organizes the synthetic data generation process into a structured, step-by-step workflow. +Each stage of the process is clearly defined and supported by guidance within the interface, helping users navigate tasks like data profiling, +metadata and synthesizer configuration and synthetic data quality evaluation.
+YData Fabric’s Community Version offers users a free, accessible entry point to explore synthetic data generation. +To get started, users can sign up for the Community Version and access the guided UI directly. +Once registered, users are provided with a range of features, including data profiling, synthetic data generation, pipelines and access to YData’s proprietary models for data quality!
+ + + + + + +