Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to regulate the range of Synthetic Data #189

Closed
GlenLeee opened this issue Jun 13, 2024 · 19 comments · Fixed by #217
Closed

How to regulate the range of Synthetic Data #189

GlenLeee opened this issue Jun 13, 2024 · 19 comments · Fixed by #217
Labels
difficulty-medium enhancement New feature or request question Further information is requested

Comments

@GlenLeee
Copy link

❓Search before asking

I have searched for issues similar to this one.

❓Description

I noticed that my original dataset contains only positive values, but the generated data includes negative values. How can I constrain the range of each column in the generated data?

@GlenLeee GlenLeee added the question Further information is requested label Jun 13, 2024
@MooooCat
Copy link
Contributor

@GlenLeee Good question!

In response to your request, we here plan to use Rule Manager in version planning (see Issue #149). This module is in development and will be released in subsequent versions.

In addition, if possible, can you provide a simple description of your data and which feature is most likely to cause this issue? So that we can further understand the situation from the application scenario side. (If this requirement is common, technically, we can also use metadata and data processor to automatically solve this requirement. This may also be a solution)

Thank you again for your question and look forward to your reply!

@GlenLeee
Copy link
Author

GlenLeee commented Jun 13, 2024

Is this an auto-reply letter hhhh?
In my dataset, each column represents the physical properties of soil. However, some columns have values that are relatively small, within the range of 0-1. The generated data includes negative values, which clearly do not comply with the physical laws. This is quite troubling for me.
It would be nice if I could restrict columns in the dataset I want to generate to a certain range.
Uploading 245eec5292e751ea0916e143837aa1d3.png…

@MooooCat
Copy link
Contributor

MooooCat commented Jun 13, 2024

Hahahaha I'm a real person, the avatar is my cat. @GlenLeee

@GlenLeee
Copy link
Author

Can you see the image I've attached? I've taken a portion of my dataset.

@MooooCat
Copy link
Contributor

Can you see the image I've attached? I've taken a portion of my dataset.

I can't see it, all I can see is this ⬇️

Uploading 245eec5292e751ea0916e143837aa1d3.png…

Theoretically, we can upload pictures in the issue. It shows uploading. Is it still uploading?

@GlenLeee
Copy link
Author

I don't know, I can see the same thing you can see.
245eec5292e751ea0916e143837aa1d3

@GlenLeee
Copy link
Author

I've posted the screenshot again and as you can see the dataset is all positive, but the data I generated using SDG has negative values, interesting question!

@MooooCat
Copy link
Contributor

I've posted the screenshot again and as you can see the dataset is all positive, but the data I generated using SDG has negative values, interesting question!

I can see the picture now. I'll try to fix this soon, lets keep in touch :)

@GlenLeee
Copy link
Author

I can see the picture now. I'll try to fix this soon, lets keep in touch :)

Thank u sooo much :)

@walkovernamtso
Copy link

same problem when using the iris dataset.
Has this issue been solved?

@MooooCat
Copy link
Contributor

same problem when using the iris dataset. Has this issue been solved?

hi @daydayuphere ,

This issue has been resolved.

We have introduced the PositiveNegativeFilter to address this in the current main branch.

Please see PR: #217

However, the current main branch has not yet been released.

You can use !pip install git+https://github.com/hitsz-ids/synthetic-data-generator.git.

I will release a version soon.

@walkovernamtso
Copy link

3ks for your reply. Look forward to the new release.

@MooooCat
Copy link
Contributor

3ks for your reply. Look forward to the new release.

We have released version 0.2.1, which should cover the feature mentioned above.

@Bhargav-Ravinuthala
Copy link

How to test this? looking for a code example...

@Bhargav-Ravinuthala
Copy link

iam still getting the negative values when i run the code with the latest updates

import pandas as pd
from sdgx.data_connectors.csv_connector import CsvConnector
from sdgx.models.ml.single_table.ctgan import CTGANSynthesizerModel
from sdgx.synthesizer import Synthesizer
from sdgx.utils import download_demo_data

# This will download demo data to ./dataset
dataset_csv = download_demo_data()

# Create data connector for csv file
data_connector = CsvConnector(path=r"C:\Users\Bhargav\Downloads\Book1.csv")

# Initialize synthesizer, use CTGAN model
synthesizer = Synthesizer(
    model=CTGANSynthesizerModel(epochs=1),  # For quick demo
    data_connector=data_connector,
)

# Fit the model
synthesizer.fit()

# Sample synthetic data
sampled_data = synthesizer.sample(1000)

# Save sampled data to CSV
output_path = r"C:\Users\Bhargav\Downloads\synthetic_data.csv"
sampled_data.to_csv(output_path, index=False)

print(f"Synthetic data saved to {output_path}")

@jalr4ever
Copy link
Collaborator

same problem when using the iris dataset. Has this issue been solved?

hi @daydayuphere ,

This issue has been resolved.

We have introduced the PositiveNegativeFilter to address this in the current main branch.

Please see PR: #217

However, the current main branch has not yet been released.

You can use !pip install git+https://github.com/hitsz-ids/synthetic-data-generator.git.

I will release a version soon.

I glanced at the PR, and the issue may be that the Filter was not added to the Process handling chain; perhaps this requires such a functional test from unit testing. We will supplement relevant content later.

@jalr4ever jalr4ever added the enhancement New feature or request label Nov 1, 2024
@jalr4ever
Copy link
Collaborator

iam still getting the negative values when i run the code with the latest updates

import pandas as pd
from sdgx.data_connectors.csv_connector import CsvConnector
from sdgx.models.ml.single_table.ctgan import CTGANSynthesizerModel
from sdgx.synthesizer import Synthesizer
from sdgx.utils import download_demo_data

# This will download demo data to ./dataset
dataset_csv = download_demo_data()

# Create data connector for csv file
data_connector = CsvConnector(path=r"C:\Users\Bhargav\Downloads\Book1.csv")

# Initialize synthesizer, use CTGAN model
synthesizer = Synthesizer(
    model=CTGANSynthesizerModel(epochs=1),  # For quick demo
    data_connector=data_connector,
)

# Fit the model
synthesizer.fit()

# Sample synthetic data
sampled_data = synthesizer.sample(1000)

# Save sampled data to CSV
output_path = r"C:\Users\Bhargav\Downloads\synthetic_data.csv"
sampled_data.to_csv(output_path, index=False)

print(f"Synthetic data saved to {output_path}")

Could you provide us with a small CSV dataset to help us reproduce the bug?

@Wh1isper
Copy link
Collaborator

Wh1isper commented Nov 6, 2024

If there are any bug reports about this features, feel free to open a new issue to discuss them!

@Bhargav-Ravinuthala
Copy link

Raised a bug Report #231

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
difficulty-medium enhancement New feature or request question Further information is requested
Projects
None yet
6 participants