New feature support for DataFrameConnector, NormalizedFrequencyEncoder & NormalizedLabelEncoder; CTGAN Optimization and Performance Enhancements. #247

cyantangerine · 2024-11-26T14:27:36Z

Description

Performance improvement: The performance improvement of Disk_cache.
For pd.concat, Iterative connections are slower than one-time connections because each connection requires calculating the index.
New feature: The DataFrameConnector can be used for smaller datasets, all loaded in memory, without the need to store on disk causing performance loss. It can reduce disk damage for large amounts of small data.
The DataFrameConnector is a proxy direct to connect DataFrame.
2.1 Simultaneously update the NoCache logic of DataLoader corresponding to DataFrameConnector, removed disk file solving time.
2.2 Simultaneously supports whether NDArrayLoader allows save_to_file.
New feature: NormalizedFrequencyEncoder and NormalizedLabelEncoder, supports encoding categorical data to a float number, can significantly reduce training dimensions, greatly improve training speed, and significantly reduce training memory for large datasets. At the same time, the data quality has been evaluated by SDV with minimal reduction.
For NormalizedLabelEncoder, in fit, using sorted unique values, it will create a [-1,1] key to value a map by uniform distribution. In transform, it maps the value to key. In reverse_transform, it find a nearest key to reverse-map the key to value.
3.1 Update Metadata simultaneously to support specifying encoders for table columns. Furthermore, it supports checking independent values using categorical_threshold and automatically selecting encoders for table columns.
The type of encoders are 'label' and 'onehot' now. If the column's unique count > threshold or it's encoder has been specified as 'label', the CTGAN model will choose the 'NormalizedLabelEncoder' to transform the column.
3.2 Add a linear activation function in CTGAN to adapt to label encoding.
Code structure optimization: Provide a BatchedSynthesizerModel class to support models generated by Batched. Rename batch_size in TVAE to match CTGAN's _batch_size.
Generation performance optimization: CTGAN.sample supports setting the drop_more parameter to provide more generated content for the Synthesizer's sample at a time, reducing the loss caused by excessive generation. At the same time, in order to prevent the calculation of excessively generated values from being too small, it is considered to take the maximum value of batch_size with BatchedSynthesizer (description 4).
New feature: Loading models from disk allows for device specification instead of default CUDA if available.
By added **kwargs in load function to allow using model_kwargs (exist arg) in model.init. Device param is included.
Fix: numeric inspector error when using int32 or float32.
Instead checking type in array by using pandas type-checking api.
Performance: DatetimeFormatter using mean to instead of zero in timestamp.
If the formatter go error, it replace the value as zero in the past. Now, we replace it as NaN, after all the value replaced, we fill the NaN and NULL as the mean of all values.
Performance: If we used DatetimeFormatter, remove the successful formatted datetime columns from metadata.discrete_columns and change it type to float.

Motivation and Context

Description 2 solve the problem mentioned in #246 .
Description 3 partially solve the problem mentioned in #77 .

How has this been tested?

Some tests are been given.

Types of changes

Maintenance (no change in code, maintain the project's CI, docs, etc.)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)

Checklist:

My code follows the code style of this project.
My change requires a change to the documentation.
I have updated the documentation accordingly.

# Conflicts: # sdgx/data_processors/formatters/datetime.py # sdgx/models/components/optimize/sdv_ctgan/data_sampler.py # sdgx/models/components/optimize/sdv_ctgan/data_transformer.py # sdgx/models/ml/single_table/ctgan.py

sdgx/models/components/sdv_rdt/transformers/categorical.py

Wh1isper · 2024-11-29T07:14:54Z

Thank you for your hard work on this! This is a large Pull Request, and it would be much easier to review if it were split into several smaller PRs. Specifically, combining performance improvements with new features may lead us to spend more time discussing the implementation of the new features, which could delay the merging of the performance improvements.

I may find some time this weekend to review it!

for more information, see https://pre-commit.ci

…a-generator

codecov-commenter · 2024-11-30T04:44:53Z

⚠️ Please install the to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

Attention: Patch coverage is 93.20755% with 18 lines in your changes missing coverage. Please review.

Project coverage is 83.82%. Comparing base (00685a3) to head (d1263e9).
Report is 22 commits behind head on main.

Files with missing lines	Patch %	Lines
sdgx/data_models/metadata.py	91.35%	7 Missing ⚠️
sdgx/data_processors/formatters/datetime.py	68.75%	5 Missing ⚠️
sdgx/models/components/optimize/ndarray_loader.py	91.17%	3 Missing ⚠️
sdgx/data_connectors/dataframe_connector.py	96.96%	1 Missing ⚠️
sdgx/data_processors/formatters/int.py	66.66%	1 Missing ⚠️
.../components/optimize/sdv_ctgan/data_transformer.py	97.67%	1 Missing ⚠️

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #247      +/-   ##
==========================================
+ Coverage   82.17%   83.82%   +1.65%     
==========================================
  Files          84       89       +5     
  Lines        4146     4656     +510     
==========================================
+ Hits         3407     3903     +496     
- Misses        739      753      +14

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

sdgx/data_connectors/dataframe_connector.py

sdgx/data_loader.py

sdgx/models/components/optimize/ndarray_loader.py

sdgx/models/components/optimize/sdv_ctgan/data_transformer.py

Wh1isper · 2024-11-30T16:44:08Z

I just reviewed this PR and I'm ok for most of the bug fixes and improvements, Thanks @cyantangerine !

Co-authored-by: Zhongsheng Ji <9573586@qq.com>

for more information, see https://pre-commit.ci

…a-generator

for more information, see https://pre-commit.ci

cyantangerine · 2024-12-01T10:15:15Z

All the questions behind has been solved. Thanks for reviewing. @jalr4ever @Wh1isper

Wh1isper

LGTM

for more information, see https://pre-commit.ci

jalr4ever

🚀 Approved!

cyantangerine added 30 commits September 25, 2024 16:37

1

99f6080

q

f8b070b

1

570f9f0

1

bae8da9

test

245176f

可选encoder

7cc70bb

进度说明

03af873

修复bug，归一化

d98f948

100k

b76ed1b

100k

db2cc72

1ktest

fe3b912

1ktest

69f2bcc

1ktest

5eb9252

1ktest

9dc32e3

test

486bbbf

1

e54a969

1ktest

a97c839

1ktest

31489ad

1

817b4a3

test

3b004fb

1ktest

aecefb5

Rfecv

db01439

Rfecv

7abc859

Rfecv

877afa4

Rfecv

32324c8

1

cc99035

Merge branch 'refs/heads/ref'

2902d63

# Conflicts: # sdgx/data_processors/formatters/datetime.py # sdgx/models/components/optimize/sdv_ctgan/data_sampler.py # sdgx/models/components/optimize/sdv_ctgan/data_transformer.py # sdgx/models/ml/single_table/ctgan.py

param

9003162

test

ed4f937

test

6dedad7

jalr4ever requested changes Nov 29, 2024

View reviewed changes

sdgx/models/components/sdv_rdt/transformers/categorical.py Show resolved Hide resolved

doc

2c7a565

cyantangerine and others added 2 commits November 29, 2024 22:59

fix

b780806

[pre-commit.ci] auto fixes from pre-commit.com hooks

bdc49b9

for more information, see https://pre-commit.ci

cyantangerine requested a review from jalr4ever November 29, 2024 15:04

cyantangerine added 2 commits November 30, 2024 12:28

Update .gitignore

ff4dd28

Merge branch 'main' of https://github.com/cyantangerine/synthetic-dat…

d1263e9

…a-generator