Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support passing tabular constraints to the HMA1 model #296

Closed
myrthewouters opened this issue Jan 13, 2021 · 6 comments · Fixed by #660
Closed

Support passing tabular constraints to the HMA1 model #296

myrthewouters opened this issue Jan 13, 2021 · 6 comments · Fixed by #660
Assignees
Labels
data:multi-table Related to multi-table, relational datasets feature:constraints Related to inputting rules or business logic feature request Request for a new feature
Milestone

Comments

@myrthewouters
Copy link

myrthewouters commented Jan 13, 2021

Environment Details

  • SDV version: 0.6.1
  • Python version: 3.7.0
  • Operating System: Windows 10

Error Description

Hi,

I am running the HMA1 model on my dataset. The data consists of two tables. I am trying to add constraints to the HMA1 model that ensure the sampled values of some columns are positive. However, I get an error when adding the constraints as model kwargs.

Steps to reproduce

My dataset is private, so I reproduced the issue with your relational demo dataset with the code below.

import pandas as pd
import numpy as np
from sdv import load_demo

tables = load_demo()

# Delete transactions table as my data also only has 2 tables (but not really crucial to do so)
del tables['transactions']

# 2 random columns to put constraints on
# As these columns contain some zeros, there is a considerable chance to sample negative values after training
tables['users']['random_user_column'] = np.random.choice(range(0, 81), 10, p=[0.5]+80*[0.5/80])
tables['sessions']['random_session_column'] = np.random.choice(range(0, 31), 10, p=[0.5]+30*[0.5/30])

# Metadata
from sdv import Metadata
metadata = Metadata()

metadata.add_table(
    name='users',
    data=tables['users'],
    primary_key='user_id'
)

metadata.add_table(
    name='sessions',
    data=tables['sessions'],
    primary_key='session_id',
    parent='users',
    foreign_key='user_id'
)

# Constraints
from sdv.constraints import CustomConstraint

def is_positive_random_users(data):
    column = data['random_user_column']
    return column >= 0

def is_positive_random_sessions(data):
    column = data['random_session_column']
    return column >= 0

positive_random_users_constraint = CustomConstraint(is_valid=is_positive_random_users)
positive_random_sessions_constraint = CustomConstraint(is_valid=is_positive_random_sessions)
constraints = [positive_random_users_constraint, positive_random_sessions_constraint]

# HMA1 model
from sdv.relational import HMA1

# Set model kwargs, including constraints
model_kwargs = {'default_distribution': 'gaussian',
                'categorical_transformer': 'categorical_fuzzy',
                'constraints': constraints}

model = HMA1(metadata, model_kwargs=model_kwargs)
model.fit(tables)

Error:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-10-4e6e9d6cc392> in <module>
     55 
     56 model = HMA1(metadata, model_kwargs=model_kwargs)
---> 57 model.fit(tables)

~\AppData\Local\conda\conda\envs\sdvault\lib\site-packages\sdv\relational\base.py in fit(self, tables)
     60                 indicated in ``metadata``. Defaults to ``None``.
     61         """
---> 62         self._fit(tables)
     63         self.fitted = True
     64 

~\AppData\Local\conda\conda\envs\sdvault\lib\site-packages\sdv\relational\hma.py in _fit(self, tables)
    197         for table_name in self.metadata.get_tables():
    198             if not self.metadata.get_parents(table_name):
--> 199                 self._model_table(table_name, tables)
    200 
    201         LOGGER.info('Modeling Complete')

~\AppData\Local\conda\conda\envs\sdvault\lib\site-packages\sdv\relational\hma.py in _model_table(self, table_name, tables, foreign_key)
    162         if primary_key:
    163             table = table.set_index(primary_key)
--> 164             table = self._extend_table(table, tables, table_name)
    165 
    166         table_meta = self._prepare_for_modeling(table, table_name, primary_key)

~\AppData\Local\conda\conda\envs\sdvault\lib\site-packages\sdv\relational\hma.py in _extend_table(self, table, tables, table_name)
    100         for child_name in self.metadata.get_children(table_name):
    101             child_key = self.metadata.get_foreign_key(table_name, child_name)
--> 102             child_table = self._model_table(child_name, tables, child_key)
    103             extension = self._get_extension(child_name, child_table, child_key)
    104             table = table.merge(extension, how='left', right_index=True, left_index=True)

~\AppData\Local\conda\conda\envs\sdvault\lib\site-packages\sdv\relational\hma.py in _model_table(self, table_name, tables, foreign_key)
    172         LOGGER.info('Fitting %s for table %s; shape: %s', self._model.__name__,
    173                     table_name, table.shape)
--> 174         model = self._model(**self._model_kwargs, table_metadata=table_meta)
    175         model.fit(table)
    176         self._models[table_name] = model

~\AppData\Local\conda\conda\envs\sdvault\lib\site-packages\sdv\tabular\copulas.py in __init__(self, field_names, field_types, field_transformers, anonymize_fields, primary_key, constraints, table_metadata, field_distributions, default_distribution, categorical_transformer)
    223             primary_key=primary_key,
    224             constraints=constraints,
--> 225             table_metadata=table_metadata,
    226         )
    227 

~\AppData\Local\conda\conda\envs\sdvault\lib\site-packages\sdv\tabular\base.py in __init__(self, field_names, field_types, field_transformers, anonymize_fields, primary_key, constraints, table_metadata)
     79                 if arg:
     80                     raise ValueError(
---> 81                         'If table_metadata is given {} must be None'.format(arg.__name__))
     82 
     83             if isinstance(table_metadata, dict):

AttributeError: 'list' object has no attribute '__name__'
@myrthewouters myrthewouters added bug Something isn't working pending review labels Jan 13, 2021
@csala
Copy link
Contributor

csala commented Feb 5, 2021

Thanks for reporting this @myrthewouters

There seems to a problem in how the GaussianCopula models capture the arguments inside HMA1.

I suspect that you should be able to work around this if you dump the metadata as a JSON or dict and define the constraints there, but this is a functionality that still needs to be fully reviewed and documented.

If you want to explore it, try creating a Table instance with constraints in it and then calling its to_dict method. You should be able to set the output of that you obtain as the corresponding table entry within the dataset metadata data.

@csala
Copy link
Contributor

csala commented Feb 5, 2021

Changing a bit the issue title and its type to reflect that this is a new feature that still needs to be implemented.

@csala csala changed the title Adding constraints in HMA1 model Support passing tabular constraints to the HMA1 model Feb 5, 2021
@csala csala added feature request Request for a new feature and removed bug Something isn't working labels Feb 5, 2021
@npatki
Copy link
Contributor

npatki commented May 20, 2021

Some more details about the workaround described in the prior comment (@csala please correct if this is wrong)

Create your own multi-table metadata json file that includes constraints. Rough outline:

  1. Create a single table instance using all the constraints you need (user guide for ref)
  2. Instantiate and fit the single table model. Then print out its metadata
data = None # TODO load single table data
constraints = [] # TODO create all the constraints

model = GaussianCopula(constraints=constraints)
model.fit(data)
print(model.get_metadata())
  1. The printed metadata will be for the entire table and it will include all the constraints you've added. Now, can copy-paste the json for each individual table when creating a multi-table schema (eg. in this case, "users" table json, etc.)
  2. Create the entire multi-table metadata schema that way, then pass it into HMA1 as the metadata.

I understand this is a very manual workaround, and not a desired end-state. We should keep this bug open to track a proper fix.

@kvrameshreddy
Copy link

Hi @npatki ,
when I use the above method and checked the metadata, there is no difference in the metadata for model having constraints and model not having constraints.
`from sdv.demo import load_tabular_demo
from sdv.constraints import GreaterThan
from sdv.tabular import GaussianCopula

employees = load_tabular_demo()

age_gt_age_when_joined_constraint = GreaterThan(low='age_when_joined',high='age',handling_strategy='reject_sampling')

constraints = [age_gt_age_when_joined_constraint]

gc = GaussianCopula(constraints=constraints)

gc.fit(employees)

gc.get_metadata()

#################
''' without constraints'''
employees = load_tabular_demo()

gc1 = GaussianCopula()

gc1.fit(employees)

gc1.get_metadata()`

metadata

can you please provide a example metadata with constraints added.

Thankyou.

@tim5go
Copy link

tim5go commented Jul 14, 2021

@kvrameshreddy
I found this would work, and it is more elegant.

from sdv import Metadata
from sdv.relational import HMA1

metadata = Metadata()

metadata.add_table(
     name='YOUR_TABLE_NAME',
     ....
)

constraints = [
   .....
]

# Add the following line to hack it !!!
metadata._metadata['tables']['YOUR_TABLE_NAME']['constraints'] = constraints

model = HMA1(metadata)

The problem with function get_metadata() is that it doesn't print the constraints.
Not printing doesn't mean the model doesn't have the constraints.

@kvrameshreddy
Copy link

Hi @tim5go,

Thanks for the reply. You are absolutely correct.
my issue is, if we want use HMA1 model with custom metadata(I will construct a metadata json on my own) without using "Metadata" method. In this case how should I include a constraint into the metadata.

@katxiao katxiao added feature:constraints Related to inputting rules or business logic data:multi-table Related to multi-table, relational datasets and removed pending review labels Nov 18, 2021
@katxiao katxiao added this to the 0.13.1 milestone Dec 22, 2021
@katxiao katxiao self-assigned this Dec 22, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data:multi-table Related to multi-table, relational datasets feature:constraints Related to inputting rules or business logic feature request Request for a new feature
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants