-
Notifications
You must be signed in to change notification settings - Fork 551
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New feature support for DataFrameConnector, NormalizedFrequencyEncoder & NormalizedLabelEncoder; CTGAN Optimization and Performance Enhancements. #247
Conversation
# Conflicts: # sdgx/data_processors/formatters/datetime.py # sdgx/models/components/optimize/sdv_ctgan/data_sampler.py # sdgx/models/components/optimize/sdv_ctgan/data_transformer.py # sdgx/models/ml/single_table/ctgan.py
Thank you for your hard work on this! This is a large Pull Request, and it would be much easier to review if it were split into several smaller PRs. Specifically, combining performance improvements with new features may lead us to spend more time discussing the implementation of the new features, which could delay the merging of the performance improvements. I may find some time this weekend to review it! |
for more information, see https://pre-commit.ci
Codecov ReportAttention: Patch coverage is
❗ Your organization needs to install the Codecov GitHub app to enable full functionality. Additional details and impacted files@@ Coverage Diff @@
## main #247 +/- ##
==========================================
+ Coverage 82.17% 83.82% +1.65%
==========================================
Files 84 89 +5
Lines 4146 4656 +510
==========================================
+ Hits 3407 3903 +496
- Misses 739 753 +14 ☔ View full report in Codecov by Sentry. |
I just reviewed this PR and I'm ok for most of the bug fixes and improvements, Thanks @cyantangerine ! |
Co-authored-by: Zhongsheng Ji <9573586@qq.com>
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
All the questions behind has been solved. Thanks for reviewing. @jalr4ever @Wh1isper |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🚀 Approved!
Description
For pd.concat, Iterative connections are slower than one-time connections because each connection requires calculating the index.
The DataFrameConnector is a proxy direct to connect DataFrame.
2.1 Simultaneously update the NoCache logic of DataLoader corresponding to DataFrameConnector, removed disk file solving time.
2.2 Simultaneously supports whether NDArrayLoader allows save_to_file.
For NormalizedLabelEncoder, in fit, using sorted unique values, it will create a [-1,1] key to value a map by uniform distribution. In transform, it maps the value to key. In reverse_transform, it find a nearest key to reverse-map the key to value.
3.1 Update Metadata simultaneously to support specifying encoders for table columns. Furthermore, it supports checking independent values using categorical_threshold and automatically selecting encoders for table columns.
The type of encoders are 'label' and 'onehot' now. If the column's unique count > threshold or it's encoder has been specified as 'label', the CTGAN model will choose the 'NormalizedLabelEncoder' to transform the column.
3.2 Add a linear activation function in CTGAN to adapt to label encoding.
By added **kwargs in load function to allow using model_kwargs (exist arg) in model.init. Device param is included.
Instead checking type in array by using pandas type-checking api.
If the formatter go error, it replace the value as zero in the past. Now, we replace it as NaN, after all the value replaced, we fill the NaN and NULL as the mean of all values.
Motivation and Context
Description 2 solve the problem mentioned in #246 .
Description 3 partially solve the problem mentioned in #77 .
How has this been tested?
Some tests are been given.
Types of changes
Checklist: