Releases · sdv-dev/SDV

23 Aug 19:44

v1.4.0

2d11113

v1.4.0 - 2023-08-23

This release makes multiple improvements to the metadata. Both the single and multi table metadata classes now have a validate_data method. This method runs checks to validate the data against the current specifications in the metadata. The SingleTableMetadata.visualize is also improved. The sequence index is now shown in the same section as the sequence key. It also now shows all key and index information (eg. sequence key, primary key, sequence index) in one section.

The CTGANSynthesizer has been made more efficient in the following ways:

Boolean columns are now being skipped during preprocess like categorial columns are.
It is possible to apply other transformations to categorical columns and have CTGAN skip the one-hot encoding step.

Additional changes include that the columns labeled with the sdtype id will now go through the IDGenerator transformer by default and constraint transformations that were being overwritten during sampling will now be respected.

New Features

Add validate_data method to Metadata - Issue #1518 by @fealho
Use IDGenerator for ID columns - Issue #1519 by @frances-h
Metadata visualization for sequential data: Only create 2 sections - Issue #1543 by @frances-h

Bugs Fixed

Inefficient CTGAN modeling when adding categorical transformers - Issue #1450 by @fealho
CTGANSynthesizer is assigning LabelEncoder to boolean columns (instead of None) - Issue #1530 by @fealho
Metadata visualization for sequential data: Missing sequence index - Issue #1542 by @frances-h
Constraint outputs are being overwritten in DataProcessor.reverse_transform - Issue #1551 by @amontanez24

Contributors

amontanez24, frances-h, and fealho

Assets 2

14 Aug 17:52

amontanez24

v1.3.0

57a5e29

v1.3.0 - 2023-08-14

This release adds two new methods to the MultiTableMetadata: detect_from_csvs and detect_From_dataframes. These methods allow you to detect metadata for a whole dataset at once by either loading them from a folder or a dictionary mapping table names to the pandas.DataFrames. The SingleTableMetadata can now be visualized! Additionally, there is now a summarized option in the show_table_details parameter of the visualize methods. This will print each sdtype in the table and the number of columns that have that sdtype.

Additionally, this release patches a bug that prevented custom constraints from working on columns that were primary or alternate keys. It also adds support for Python 3.11!

New Features

Align default transformers between SDV and RDT - Issue #1484 by @R-Palazzo
Add visualize method to SingleTableMetadata - Issue #1517 by @pvk-developer
Add detect_from_csvs and detect_from_dataframes methods to MultiTableMetadata - Issue #1520 by @R-Palazzo
Allow empty tables to be fitted using fit_processed_data - Issue #1524 by @fealho
Summarized metadata visualization - Issue #1525 by @pvk-developer

Bugs Fixed

Cannot use custom constraint transforms for certain columns (inconsistent ordering in forward vs. reverse) - Issue #1476 by @fealho
Cannot create custom constraint with primary key - Issue #1514 by @amontanez24

Maintenance

Add support for Python 3.11 - Issue #1459 by @fealho

Contributors

amontanez24, fealho, and 2 other contributors

Assets 2

13 Jul 20:01

amontanez24

v1.2.1

47e9dcf

v1.2.1 - 2023-07-13

This release fixes a bug that caused the Inequality constraint and others to fail if there were None values in a datetime column.

Bugs Fixed

Inequality fails with None and datetime - Issue #1471 by @pvk-developer

Maintenance

Drop support for Python 3.7 - Issue #1487 by @pvk-developer

Internal

Make HMA use hierarchical sampling mixin - Issue #1428 by @frances-h
Move progress bar out of base multi table synthesizer - Issue #1486 by @R-Palazzo

Contributors

frances-h, pvk-developer, and R-Palazzo

Assets 2

07 Jun 20:48

amontanez24

v1.2.0

0665bf8

v1.2.0 - 2023-06-07

This release adds a parameter called verbose to the HMASynthesizer's initialization. Setting it to True will show progress bars during the fitting steps. Additionally, performance optimizations were made to the modeling and initialization of the HMASynthesizer.

Multiple changes were made to enhance constraints. The Range constraint was improved to be able to generate more accurate data when null values are provided. Constraints are also now validated against the data when running validate() on any synthesizer.

Finally, some warnings were resolved.

New Features

Report fitting progress for the HMASynthesizer - Issue #1440 by @pvk-developer

Bugs Fixed

Range constraint does not produce cases of missing values & may create violative data - Issue #1393 by @R-Palazzo
Synthesizers don't validate constraints during validate() - Issue #1402 by @pvk-developer
Confusing error during metadata validation - Issue #1417 by @frances-h
SettingWithCopyWarning when conditional sampling - #1436 by @pvk-developer
HMASynthesizer is modeling child tables - Issue #1442 by @pvk-developer
ValueError when sampling PII columns - Issue #1445 by @pvk-developer

Internal

Add BaseHierarchicalSampler Mixin - Issue #1394 by @frances-h
Add BaseIndependentSampler Mixin - Issue #1395 by @frances-h
Synthesizers created twice during HMA init - Issue #1418 by @frances-h
Get rid of unnecessary methods for single table sampling - Issue #1430 by @amontanez24
Detect all addons from top level init - PR #1453 by @frances-h

Maintenance

Upgrade to torch 2.0 - Issue #1365 by @fealho
During fit, there is a FutureWarning (due to RDT 1.5.0) - Issue #1456 by @amontanez24

Contributors

amontanez24, frances-h, and 3 other contributors

Assets 2

10 May 21:12

amontanez24

v1.1.0

280317e

v1.1.0 - 2023-05-10

This release adds a new initialization parameter to synthesizers called locales that allows users to set the locales to use for all columns that have a locale based sdtype (eg. address or phone_number). Additionally, it adds support for Pandas 2.0!

Multiple enhancements were made to improve the performance of data and metadata validation in synthesizers. The Inequality constraint was improved to be able to generate more scenarios of data concerning the presence of NaNs. Finally, many warnings have been resolved.

New Features

Add add-on detection for new constraints - Issue #1397 by @frances-h
Add add-on detection for multi and single table synthesizers - Issue #1385 by @frances-h
Setting a locale for all my anonymized (PII) columns - Issue #1371 by @frances-h

Bugs Fixed

Skip checking for Faker function if its a default sdtype - PR #1410 by @pvk-developer
Inequality constraint does not produce all possibilities of missing values - Issue #1392 by @R-Palazzo
Deprecated locale warning - Issue #1400 by @frances-h

Maintenance

Use cached Faker instance to discover if an sdtype is a Faker function. - Issue #1405 by @pvk-developer
Upgrade to pandas 2.0 - Issue #1366 by @pvk-developer

Internal

Refactor Multi Table Modeling - Issue #1387 by @pvk-developer
PytestConfigWarning: Unknown config option: collect_ignore - Issue #1376 by @amontanez24
Pandas FutureWarning: Could not cast to int64 - Issue #1357 by @R-Palazzo
RuntimeWarning: invalid value encountered in cast. - Issue #1369 by @amontanez24

Contributors

amontanez24, frances-h, and 2 other contributors

Assets 2

20 Apr 19:04

amontanez24

v1.0.1

7fd3ad4

v1.0.1 - 2023-04-20

This release improves the load_custom_constraint_classes method by removing the table_name parameter and just loading the constraint
for all tables instead. It also improves some error messages as well as removes some of the warnings that have been surfacing.

Support for sdtypes is enhanced by resolving a bug that was incorrecttly specifying Faker functions for some of them. Support for datetime formats has also been improved. Finally, the path argument in some save and load methods was changed to filepath for consistency.

New Features

Method load_custom_constraint_classes does not need table_name parameter - Issue #1354 by @R-Palazzo
Improve error message for invalid primary keys - Issue #1341 by @R-Palazzo
Add functionality to find version add-on - Issue #1309 by @frances-h

Bugs Fixed

Certain sdtypes cause Faker to raise error - Issue #1346 by @frances-h
Change path to filepath for load and save methods - Issue #1352 by @fealho
Some datetime formats cause InvalidDataError, even if the datetime matches the format - Issue #1136 by @amontanez24

Internal

Inequality constraint raises RuntimeWarning (invalid value encountered in log) - Issue #1275 by @frances-h
Pandas FutureWarning: Default dtype for Empty Series will be 'object' - Issue #1355 by @R-Palazzo
Pandas FutureWarning: Length 1 tuple will be returned - Issue #1356 by @R-Palazzo

Contributors

amontanez24, frances-h, and 2 other contributors

Assets 2

28 Mar 20:39

amontanez24

v1.0.0

9d61166

v1.0.0 - 2023-03-28

This is a major release that introduces a new API to the SDV aimed at streamlining the process of synthetic data generation! To achieve this, this release includes the addition of several large features.

Metadata

Some of the most notable additions are the new SingleTableMetadata and MultiTableMetadata classes. These classes enable a number of features that make it easier to synthesize your data correctly such as:

Automatic data detection - Calling metadata.detect_from_dataframe() or metadata.detect_from_csv() will populate the metadata autonomously with values it thinks represent the data.
Easy updating - Once an instance of the metadata is created, values can be easily updated using a number of methods defined in the API. For more info, view the docs.
Metadata validation - Calling metadata.validate() will return a report of any invalid definitions in the metadata specification.
Upgrading - Users with the previous metadata format can easily update to the new specification using the upgrade_metadata() method.
Saving and loading - The metadata itself can easily be saved to a json file and loaded back up later.

Class and Module Names

Another major change is the renaming of our core modeling classes and modules. The name changes are meant to highlight the difference between the underlying machine learning models, and the objects responsible for the end-to-end workflow of generating synthetic data. The main name changes are as follows:

tabular -> single_table
relational -> multi_table
timeseries -> sequential
BaseTabularModel -> BaseSingleTableSynthesizer
GaussianCopula -> GaussianCopulaSynthesizer
CTGAN -> CTGANSynthesizer
TVAE -> TVAESynthesizer
CopulaGan -> CopulaGANSynthesizer
PAR -> PARSynthesizer
HMA1 -> HMASynthesizer

In SDV 1.0, synthesizers are classes that take in metadata and handle data preprocessing, model training and model sampling. This is similar to the previous BaseTabularModel in SDV <1.0.

Synthetic Data Workflow

Synthesizers in SDV 1.0 define a clear workflow for generating synthetic data.

Synthesizers are initialized with a metadata class.
They can then be used to transform the data and apply constraints using the synthesizer.preprocess() method. This step also validates that the data matches the provided metadata to avoid errors in fitting or sampling.
The processed data can then be fed into the underlying machine learning model using synthesizer.fit_processed_data(). (Alternatively, data can be preprocessed and fit to the model using synthesizer.fit().)
Data can then be sampled using synthesizer.sample().

Each synthesizer class also provides a series of methods to help users customize the transformations their data goes through. Read more about that here.

Notice that the preprocessing and model fitting steps can now be separated. This can be helpful if preprocessing is time consuming or if the data has been processed externally.

Other Highly Requested Features

Another major addition is control over randomization. In SDV <1.0, users could set a seed to control the randomization for only some columns. In SDV 1.0, randomization is controlled for all columns. Every new call to sample generates new data, but the synthesizer's seed can be reset to the original state using synthesizer.reset_randomization(), enabling reproducibility.

SDV 1.0 adds accessibility and transparency into the transformers used for preprocessing and underlying machine learning models.

Using the synthesizer.get_transformers() method, you can access the transformers used to preprocess each column and view their properties. This can be useful for debugging and accessing privacy information like mappings used to mask data.
Distribution parameters learned by copula models can be accessed using the synthesizer.get_learned_distributions() method.

PII handling is improved by the following features:

Primary keys can be set to natural sdtypes (eg. SSN, email, name). Previously they could only be numerical or text.
The PseudoAnonymizedFaker can be used to provide consistent mapping to PII columns. As mentioned before, the mapping itself can be accessed by viewing the transformers for the column using synthesizer.get_transformers().
A bug causing PII columns to slow down modeling is patched.

Finally, the synthetic data can now be easily evaluated using the evaluate_quality() and run_diagnostic() methods. The data can be compared visually to the actual data using the get_column_plot() and get_column_pair_plot() methods. For more info on how to visualize or interpret the synthetic data evaluation, read the docs here.

Issues Resolved

New Features

Change auto_assign_transformers to handle id types - Issue #1325 by @pvk-developer
Change 'text' sdtype to 'id' - Issue #1324 by @frances-h
In upgrade_metadata, return the object instead of writing it to a JSON file - Issue #1319 by @frances-h
In upgrade_metadata index primary keys should be converted to text - Issue #1318 by @amontanez24
Add load_from_dict to SingleTableMetadata and MultiTableMetadata - Issue #1314 by @amontanez24
Throw a SynthesizerInputError if FixedCombinations constraint is applied to a column that is not boolean or categorical - Issue #1306 by @frances-h
Missing save and load methods for HMASynthesizer - Issue #1262 by @amontanez24
Better input validation when creating single and multi table synthesizers - Issue #1242 by @fealho
Better input validation on HMASynthesizer.sample - Issue #1241 by @R-Palazzo
Validate that relationship must be between a primary key and foreign key - Issue #1236 by @fealho
Improve update_column validation for pii attribute - Issue #1226 by @pvk-developer
Order the output of get_transformers() based on the metadata - Issue #1222 by @pvk-developer
Log if any numerical_distributions will not be applied - Issue #1212 by @fealho
Improve error handling for GaussianCopulaSynthesizer: numerical_distributions - Issue #1211 by @fealho
Improve error handling when validating constraints - Issue #1210 by @fealho
Add fake_companies demo - Issue #1209 by @amontanez24
Allow me to create a custom constraint class and use it in the same file - Issue #1205 by @amontanez24
Sampling should reset after retraining the model - Issue #1201 by @pvk-developer
Change function name HMASynthesizer.update_table_parameters --> set_table_parameters - Issue #1200 by @pvk-developer
Add get_info method to synthesizers - Issue #1199 by @fealho
Add evaluation methods to synthesizer - Issue #1190 by @fealho
Update evaluate.py to work with the new metadata - Issue #1186 by @fealho
Remove old code - Issue #1181 by @pvk-developer
Drop support for python 3.6 and add support for 3.10 - Issue #1176 by @fealho
Add constraint methods to MultiTableSynthesizers - Issue #1171 by @fealho
Update custom constraint workflow - Issue #1169 by @pvk-developer
Add get_constraints method to synthesizers - Issue #1168 by @pvk-developer
Migrate adding and validating constraints to BaseSynthesizer - Issue #1163 by @pvk-developer
Change metadata "SCHEMA_VERSION" --> "METADATA_SPEC_VERSION" - Issue #1139 by @amontanez24
Add ability to reset random sampling - Issue #1130 by @pvk-developer
Add get_available_demos - Issue #1129 by @fealho
Add demo loading functionality - Issue #1128 by @fealho
Use logging instead of printing in detect methods - Issue #1107 by @fealho
Add save and load methods to synthesizers - Issue #1106 by @pvk-developer
Add sampling methods to PARSynthesizer - Issue #1083 by @amontanez24
Add transformer methods to PARSynthesizer - Issue #1082 by @fealho
Add validate to PARSynthesizer - Issue #1081 by @amontanez24
Add preprocess and fit methods to PARSynthesizer - I...

Contributors

amontanez24, frances-h, and 3 other contributors

Assets 2

24 Jan 21:08

amontanez24

v0.18.0

b503075

v0.18.0 - 2023-01-24

This release adds suppport for Python 3.10 and drops support for 3.6.

Maintenance

Drop support for python 3.6 - Issue #1177 by @amontanez24
Support for python 3.10 - Issue #939 by @amontanez24
Support Python >=3.10,<4 - Issue #1000 by @amontanez24

Contributors

amontanez24

Assets 2

08 Dec 23:42

amontanez24

v0.17.2

b35f288

v0.17.2 - 2022-12-08

This release fixes a bug in the demo module related to loading the demo data with constraints. It also adds a name to the demo datasets. Finally, it bumps the version of SDMetrics used.

Maintenance

Upgrade SDMetrics requirement to 0.8.0 - Issue #1125 by @katxiao

New Features

Provide a name for the default demo datasets - Issue #1124 by @amontanez24

Bugs Fixed

Cannot load_tabular_demo with metadata - Issue #1123 by @amontanez24

Contributors

katxiao and amontanez24

Assets 2

29 Sep 23:07

amontanez24

v0.17.1

b215cc1

v0.17.1 - 2022-09-29

This release bumps the dependency requirements to use the latest version of SDMetrics.

Maintenance

Patch release: Bump required version for SDMetrics - Issue #1010 by @katxiao

Contributors

katxiao

Assets 2

Releases: sdv-dev/SDV

v1.4.0 - 2023-08-23

New Features

Bugs Fixed

Contributors

v1.3.0 - 2023-08-14

New Features

Bugs Fixed

Maintenance

Contributors

v1.2.1 - 2023-07-13

Bugs Fixed

Maintenance

Internal

Contributors

v1.2.0 - 2023-06-07

New Features

Bugs Fixed

Internal

Maintenance

Contributors

v1.1.0 - 2023-05-10

New Features

Bugs Fixed

Maintenance

Internal

Contributors

v1.0.1 - 2023-04-20

New Features

Bugs Fixed

Internal

Contributors

v1.0.0 - 2023-03-28

Metadata

Class and Module Names

Synthetic Data Workflow

Other Highly Requested Features

Issues Resolved

New Features

Contributors

v0.18.0 - 2023-01-24

Maintenance

Contributors

v0.17.2 - 2022-12-08

Maintenance

New Features

Bugs Fixed

Contributors

v0.17.1 - 2022-09-29

Maintenance

Contributors