Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixed combinations Constraint #2253

Closed
Pavan-Kalyan1432 opened this issue Oct 4, 2024 · 6 comments
Closed

Fixed combinations Constraint #2253

Pavan-Kalyan1432 opened this issue Oct 4, 2024 · 6 comments
Assignees
Labels
question General question about the software resolution:resolved The issue was fixed, the question was answered, etc.

Comments

@Pavan-Kalyan1432
Copy link

Pavan-Kalyan1432 commented Oct 4, 2024

Environment details

If you are already running SDV, please indicate the following details about the environment in
which you are running it:

  • SDV version:
  • Python version:
  • Operating System:

Problem description

What I already tried

from sdv.metadata import SingleTableMetadata
from sdv.single_table import GaussianCopulaSynthesizer, CTGANSynthesizer, TVAESynthesizer
import pandas as pd
import os

real_data = pd.read_csv('data//BILLING.csv').fillna("")
real_data = real_data.dropna(axis=1, how='all')
metadata = SingleTableMetadata()
metadata.detect_from_dataframe(data=real_data)
metadata.update_columns_metadata(
    {
        "First Name":{"sdtype":"categorical"},
        "Last Name":{"sdtype":"categorical"},
        "Middle Name":{"sdtype":"categorical"},
        "Full Name":{"sdtype":"categorical"},
        "Date of Birth":{"sdtype":"date"},
        "National ID":{"sdtype":"categorical"}
    }
)

metadata.update_column("Phone Number", pii=False)

metadata.remove_primary_key()

path = 'output//metadata.json'
if os.path.exists(path):
    os.remove(path)
metadata.save_to_json(path)

my_constraint = {
    'constraint_class' : "FixedCombinations",
    'constraint_parameters' : {
        'column_names' : ['First Name', 'Middle Name', 'Last Name', 'Full Name']
    }
}

synthesizer = GaussianCopulaSynthesizer(metadata)
synthesizer.add_constraints(constraints=[my_constraint])
synthesizer.fit(real_data)

for col in real_data.columns:
    null_count = real_data[col].isnull().sum()
    empty_string_count = (real_data[col] == "").sum()
    total_nulls = null_count + empty_string_count
    total_cells = real_data.shape[0]  
    null_percentage = (total_nulls / total_cells) * 100 if total_cells > 0 else 0
    null_percent = null_percentage.round(2)
    print(f"{col} - {null_percent}%")

s = []

while True:
    column = input("Enter the column name to fix (or 'exit' to stop): ")
    if column == "exit":
        break
    if column not in real_data.columns:
        print("Column not found")
        continue
    s.append(column)

if s:
    fixed_columns = real_data[s]
    synthetic_data = synthesizer.sample_remaining_columns(fixed_columns, max_tries_per_batch=200)
else:
    synthetic_data = synthesizer.sample(num_rows=50)

synthetic_data.to_csv('output//synthetic_data_1.csv', index=False)

Here Fixed combinations is repeating the combinations but it is not considering all the combinations... What to do to make it consider all the combinations of first name, middle name, last name and full name of the real data

@Pavan-Kalyan1432 Pavan-Kalyan1432 added new Automatic label applied to new issues question General question about the software labels Oct 4, 2024
@srinify
Copy link
Contributor

srinify commented Oct 8, 2024

Hi @Pavan-Kalyan1432 can you clarify what you mean by "repeating the combinations but it is not considering all the combinations"?

  • Is the synthesizer re-using the same combinations of values from your real data?
  • Is it only re-using some of the combinations?

When generating synthetic data, using this constraint will ensure that the synthesizer will only use the same combinations of values in these 4 columns that exist in your real data. So, for example, if you only have rows containing the combination: "Jack", "John", "Jay", and "Jack John Jay" for your 4 columns, then this will be the only combination that will show up in the synthetic data.

@srinify srinify self-assigned this Oct 8, 2024
@srinify srinify added under discussion Issue is currently being discussed and removed new Automatic label applied to new issues labels Oct 8, 2024
@Pavan-Kalyan1432
Copy link
Author

For example it is repeating the same combination multiple times and also it is not considering all the combinations that are in real data

@npatki
Copy link
Contributor

npatki commented Oct 8, 2024

Hi @Pavan-Kalyan1432, if I may jump in here: The purpose of the FixedCombinations constraint is only to fix the combinations that are created. Adding this constraint will prevent new permutations from being synthesized in the columns you specify.

If you sample many many more times, then I think due to random chance, you will eventually end up creating all the combinations that were in the original data.

However, preventing repetition is not the purpose of this constraint. May I ask why you want to prevent the repetition in your data? This indicates to me that in your synthetic data, you just want the same exact same names to appear in the exact same rows as your real data. Is that correct? If you could provide more information on your usage (what are you trying to accomplish with synthetic data), we can better guide you to a solution. Thanks.

@srinify
Copy link
Contributor

srinify commented Oct 23, 2024

Hi @Pavan-Kalyan1432 we hope our responses cleared things up! Since we haven't heard from you in a while, I'm going to move forward with closing this issue out. Please don't hesitate to open a new issue or ask in our Slack for new questions!

@srinify srinify closed this as completed Oct 23, 2024
@srinify srinify added resolution:resolved The issue was fixed, the question was answered, etc. and removed under discussion Issue is currently being discussed labels Oct 23, 2024
@Pavan-Kalyan1432
Copy link
Author

How to manage inter column dependency...
For example we have 3 columns date of birth, date of death and age... In the synthetic data it is not coming properly. Give me the answer for both single table and multi table

@npatki
Copy link
Contributor

npatki commented Dec 10, 2024

Hi @Pavan-Kalyan1432, the original issue you filed was for FixedCombinations for first name and last name. Are you still having problems with this?

Your most recent question is for a different topic so I have filed a new issue here: #2318

We can continue discussion about your inter-column dependency (birth, date of death, and age) in the new issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question General question about the software resolution:resolved The issue was fixed, the question was answered, etc.
Projects
None yet
Development

No branches or pull requests

3 participants