Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using FixedCombinations constraint with an integer constraint column causes sampling to fail #2183

Closed
srinify opened this issue Aug 12, 2024 · 0 comments · Fixed by #2185
Closed
Assignees
Labels
bug Something isn't working
Milestone

Comments

@srinify
Copy link
Contributor

srinify commented Aug 12, 2024

Environment Details

  • SDV version: 1.15.0 (Latest)

Problem Description

When using FixedCombinations, if the underlying dtype of a constraint column is int, then an error is thrown during sampling.

Error Description

Error seems to start with this line of code: https://github.com/sdv-dev/SDV/blob/9301b964504ec53df977f1db5eab28b5b2e2c352/sdv/data_processing/data_processor.py

Screenshot 2024-08-12 at 4 50 57 PM

Steps to reproduce

Dataset: stock_missingcol.csv

Code:

from datetime import datetime

import pandas as pd
from sdv.evaluation.single_table import evaluate_quality, run_diagnostic
from sdv.metadata import SingleTableMetadata
from sdv.single_table import CTGANSynthesizer

constrains_column = ['Quantity', 'Total Price']
real_data = pd.read_csv('stock_missingcol.csv')
real_data.head()
metadata = SingleTableMetadata()
metadata.detect_from_dataframe(real_data)
metadata.update_columns(column_names=constrains_column, sdtype='categorical')

synthesizer = CTGANSynthesizer(metadata,epochs=2,verbose=True)
my_constraint = {
    'constraint_class': 'FixedCombinations',
    'constraint_parameters': {
        'column_names': constrains_column
    }
}

synthesizer.add_constraints(constraints=[
    my_constraint
])

synthesizer.fit(real_data)
synthetic_data = synthesizer.sample(num_rows=1000)

Internal Colab Notebook: https://colab.research.google.com/drive/1XUL42Wa13NQ2t0qewCoyCDYHHxmRO6ku?authuser=1#scrollTo=UGo3e-QJFk0a

Workaround

I was able to avoid this error by casting all int columns to float, then re-fitting, then sampling, then finally casting back to int in the sampled data.

# In real data, cast from int -> float
real_data['Total Price'] = real_data['Total Price'].astype(float)

# In synthetic data, cast from float -> int
synthetic_data['Total Price'] = synthetic_data['Total Price'].astype(int) 
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
3 participants