Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot use any scalar constraint (ScalarRange, ScalarInequality) with numerical columns that can be confused as datetimes #2328

Open
npatki opened this issue Dec 20, 2024 · 0 comments
Labels
bug Something isn't working feature:constraints Related to inputting rules or business logic

Comments

@npatki
Copy link
Contributor

npatki commented Dec 20, 2024

Environment Details

  • SDV version: 1.17.3 (latest)

Error Description

I may have a numerical column (listed as sdtype numercial in my metadata) that may be easily mistaken for a datetime. For example, it may contain numerical integers such as 2024, 2023, 2022, etc.

In such cases, I am able to generally fit and sample synthetic data. However, if I try adding any of the scalar constraints (ScalarRange, ScalarInequality), then I get an ValueError when fitting.

Steps to reproduce

import pandas as pd

from sdv.metadata import Metadata
from sdv.single_table import GaussianCopulaSynthesizer

data = pd.DataFrame(data={
    'x': [2020, 2021, 2024, 2023, 2022, 2023, 2021, 2022],
})

metadata = Metadata.load_from_dict({
    'tables': {
        'table': {
            'columns': {
                'x': { 'sdtype': 'numerical' },
            }
        }
    }
})

synth = GaussianCopulaSynthesizer(metadata)

my_constraint = {
    'constraint_class': 'ScalarRange',
    'constraint_parameters': {
        'column_name': 'x',
        'low_value': 2019,
        'high_value': 2025,
    }
}

synth.add_constraints([my_constraint])
synth.fit(data)

Output:

[/usr/local/lib/python3.10/dist-packages/sdv/constraints/tabular.py](https://localhost:8080/#) in __init__(self, column_name, low_value, high_value, strict_boundaries)
   1133         self.constraint_columns = (column_name,)
   1134         self._column_name = column_name
-> 1135         self._validate_init_inputs(low_value, high_value)
   1136         self._is_datetime = None
   1137         self._datetime_format = None

[/usr/local/lib/python3.10/dist-packages/sdv/constraints/tabular.py](https://localhost:8080/#) in _validate_init_inputs(low_value, high_value)
   1087         values_are_strings = isinstance(low_value, str) and isinstance(high_value, str)
   1088         if values_are_datetimes and not values_are_strings:
-> 1089             raise ValueError('Datetime must be represented as a string.')
   1090 
   1091         values_are_numerical = bool(_is_numerical(low_value) and _is_numerical(high_value))

ValueError: Datetime must be represented as a string.

Workaround

In the meantime, a workaround to this would be to add a constant to each value in this column such that it does not get confused for a datetime. Eg. adding 2000 will produce values such as 4040, 4021, ... what won't be assumed to be datetimes. After sampling synthetic data, you can subtract the constant value to get values in the original ranges. Note that if the same constant is added everywhere, it should not have any effect on the synthetic data quality.

data_copy = data.copy()

# add a constant value to the column so that it won't be confused with a datetime
CONST_VAL = 2000 

data_copy['x'] = data_copy['x'] + CONST_VAL

synth = GaussianCopulaSynthesizer(metadata)

# add the same constant value to the constraint
my_constraint = {
    'constraint_class': 'ScalarRange',
    'constraint_parameters': {
        'column_name': 'x',
        'low_value': 2019 + CONST_VAL,
        'high_value': 2025 + CONST_VAL,
    }
}

synth.add_constraints([my_constraint])
synth.fit(data_copy)
synthetic_data = synth.sample(num_rows=5)

# subtract it to get data back in the original range
synthetic_data['x'] = synthetic_data['x'] - CONST_VAL
@npatki npatki added bug Something isn't working feature:constraints Related to inputting rules or business logic labels Dec 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working feature:constraints Related to inputting rules or business logic
Projects
None yet
Development

No branches or pull requests

1 participant