Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New feature support for DataFrameConnector, NormalizedFrequencyEncoder & NormalizedLabelEncoder; CTGAN Optimization and Performance Enhancements. #247

Merged
merged 67 commits into from
Dec 2, 2024

Conversation

cyantangerine
Copy link
Contributor

@cyantangerine cyantangerine commented Nov 26, 2024

Description

  1. Performance improvement: The performance improvement of Disk_cache.
    For pd.concat, Iterative connections are slower than one-time connections because each connection requires calculating the index.
  2. New feature: The DataFrameConnector can be used for smaller datasets, all loaded in memory, without the need to store on disk causing performance loss. It can reduce disk damage for large amounts of small data.
    The DataFrameConnector is a proxy direct to connect DataFrame.
    2.1 Simultaneously update the NoCache logic of DataLoader corresponding to DataFrameConnector, removed disk file solving time.
    2.2 Simultaneously supports whether NDArrayLoader allows save_to_file.
  3. New feature: NormalizedFrequencyEncoder and NormalizedLabelEncoder, supports encoding categorical data to a float number, can significantly reduce training dimensions, greatly improve training speed, and significantly reduce training memory for large datasets. At the same time, the data quality has been evaluated by SDV with minimal reduction.
    For NormalizedLabelEncoder, in fit, using sorted unique values, it will create a [-1,1] key to value a map by uniform distribution. In transform, it maps the value to key. In reverse_transform, it find a nearest key to reverse-map the key to value.
    3.1 Update Metadata simultaneously to support specifying encoders for table columns. Furthermore, it supports checking independent values using categorical_threshold and automatically selecting encoders for table columns.
    The type of encoders are 'label' and 'onehot' now. If the column's unique count > threshold or it's encoder has been specified as 'label', the CTGAN model will choose the 'NormalizedLabelEncoder' to transform the column.
    3.2 Add a linear activation function in CTGAN to adapt to label encoding.
  4. Code structure optimization: Provide a BatchedSynthesizerModel class to support models generated by Batched. Rename batch_size in TVAE to match CTGAN's _batch_size.
  5. Generation performance optimization: CTGAN.sample supports setting the drop_more parameter to provide more generated content for the Synthesizer's sample at a time, reducing the loss caused by excessive generation. At the same time, in order to prevent the calculation of excessively generated values from being too small, it is considered to take the maximum value of batch_size with BatchedSynthesizer (description 4).
  6. New feature: Loading models from disk allows for device specification instead of default CUDA if available.
    By added **kwargs in load function to allow using model_kwargs (exist arg) in model.init. Device param is included.
  7. Fix: numeric inspector error when using int32 or float32.
    Instead checking type in array by using pandas type-checking api.
  8. Performance: DatetimeFormatter using mean to instead of zero in timestamp.
    If the formatter go error, it replace the value as zero in the past. Now, we replace it as NaN, after all the value replaced, we fill the NaN and NULL as the mean of all values.
  9. Performance: If we used DatetimeFormatter, remove the successful formatted datetime columns from metadata.discrete_columns and change it type to float.

Motivation and Context

Description 2 solve the problem mentioned in #246 .
Description 3 partially solve the problem mentioned in #77 .

How has this been tested?

Some tests are been given.

Types of changes

  • Maintenance (no change in code, maintain the project's CI, docs, etc.)
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)

Checklist:

  • My code follows the code style of this project.
  • My change requires a change to the documentation.
  • I have updated the documentation accordingly.

@cyantangerine cyantangerine changed the title New feature support for DataFrameConnector & NormalizedLabelEncoder, CTGAN Optimization, and Performance Enhancements. New feature support for DataFrameConnector, NormalizedFrequencyEncoder & NormalizedLabelEncoder; CTGAN Optimization and Performance Enhancements. Nov 29, 2024
@Wh1isper
Copy link
Collaborator

Thank you for your hard work on this! This is a large Pull Request, and it would be much easier to review if it were split into several smaller PRs. Specifically, combining performance improvements with new features may lead us to spend more time discussing the implementation of the new features, which could delay the merging of the performance improvements.

I may find some time this weekend to review it!

@codecov-commenter
Copy link

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

Attention: Patch coverage is 93.20755% with 18 lines in your changes missing coverage. Please review.

Project coverage is 83.82%. Comparing base (00685a3) to head (d1263e9).
Report is 22 commits behind head on main.

Files with missing lines Patch % Lines
sdgx/data_models/metadata.py 91.35% 7 Missing ⚠️
sdgx/data_processors/formatters/datetime.py 68.75% 5 Missing ⚠️
sdgx/models/components/optimize/ndarray_loader.py 91.17% 3 Missing ⚠️
sdgx/data_connectors/dataframe_connector.py 96.96% 1 Missing ⚠️
sdgx/data_processors/formatters/int.py 66.66% 1 Missing ⚠️
.../components/optimize/sdv_ctgan/data_transformer.py 97.67% 1 Missing ⚠️

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #247      +/-   ##
==========================================
+ Coverage   82.17%   83.82%   +1.65%     
==========================================
  Files          84       89       +5     
  Lines        4146     4656     +510     
==========================================
+ Hits         3407     3903     +496     
- Misses        739      753      +14     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

sdgx/data_loader.py Outdated Show resolved Hide resolved
@Wh1isper
Copy link
Collaborator

I just reviewed this PR and I'm ok for most of the bug fixes and improvements, Thanks @cyantangerine !

@cyantangerine
Copy link
Contributor Author

All the questions behind has been solved. Thanks for reviewing. @jalr4ever @Wh1isper

Copy link
Collaborator

@Wh1isper Wh1isper left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Collaborator

@jalr4ever jalr4ever left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀 Approved!

@jalr4ever jalr4ever merged commit 6c9ecd1 into hitsz-ids:main Dec 2, 2024
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants