Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update custom constraints documentation #857

Merged
merged 6 commits into from
Jul 6, 2022
Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions docs/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,7 @@
'nbsphinx',
'sphinx.ext.autodoc',
'sphinx.ext.autosummary',
'sphinx.ext.autosectionlabel',
'sphinx.ext.githubpages',
'sphinx.ext.viewcode',
'sphinx.ext.napoleon',
Expand All @@ -42,6 +43,7 @@
]

ipython_execlines = [
"from utils import is_valid, transform, reverse_transform",
"import pandas as pd",
"pd.set_option('display.width', 1000000)",
"pd.set_option('display.max_columns', 1000)",
Expand Down
Binary file added docs/images/custom_constraint.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
292 changes: 116 additions & 176 deletions docs/user_guides/single_table/custom_constraints.rst
Original file line number Diff line number Diff line change
@@ -1,247 +1,187 @@
.. _custom_constraints:

Defining Custom Constraints
===========================
Custom Constraints
==================

In some cases, the predefined constraints do not cover all your needs.
In such scenarios, you can use ``CustomConstraint`` to define your own
logic on how to constrain your data. There are three main functions that
you can create:
If you have business logic that cannot be represented using
:ref:`Predefined Constraints <Predefined Constraints>`,
you can define custom logic. In this guide, we'll walk through the process for defining a custom
constraint and using it.

- ``transform`` which is responsible for the forward pass when using ``transform`` strategy.
Its main function is to change your data in a way that enforces the constraint.
- ``reverse_transform`` which defines how to reverse the transformation of the ``transform``.
- ``is_valid`` which indicates which rows satisfy the constraint and which ones do not.
Defining your custom constraint
-------------------------------
To define your custom constraint you need to write some functionality in a separate Python file.
This includes:

Let's look at a demo dataset:
* **Validity Check**: A test that determines whether a row in the data meets the rule, and
* (optional) **Transformation Functions**: Functions to modify the data before & after modeling

.. ipython:: python
:okwarning:

from sdv.demo import load_tabular_demo

employees = load_tabular_demo()
employees
The SDV then uses the functionality you provided, as shown in the diagram below.

The dataset defined in :ref:`handling_constraints` contains basic details about employees.
We will use this dataset to demonstrate how you can create your own constraint.
.. image:: /images/custom_constraint.png

Each function (validity, transform and reverse transform) must accept the same inputs:

Using the ``CustomConstraint``
------------------------------
- column_names: The names of the columns involved in the constraints
- data: The full dataset, represented as a pandas.DataFrame
- <other parameters>: Any other parameters that are necessary for your logic

We wish to generate synthetic data from the ``employees`` records. If you look at the data
above, you will notice that the ``salary`` column is a multiple of a *base* value, in
this case the base unit is 500. In other words, the ``salary`` increments by 500.
We will define ``transform`` and ``reverse_transform`` methods to make sure our
data satisfy our constraint.

We can achieve our goal by performing transformations in a 2 step process:

- Divide ``salary`` by the base unit (500). This transformation makes it easier for the model
to learn the data since it would now learn regular integer values without any explicit constraint on the data.
- Reversing the effect by multiplying ``salary`` back with the base unit. Now that the model has
learned regular integer values, we multiply it with the base (500) such that it now conforms to our original data range.
Example
~~~~~~~

Let's demonstrate this using our demo dataset.

.. ipython:: python
:okwarning:

def transform(table_data):
base = 500.
table_data['salary'] = table_data['salary'] / base
return table_data
from sdv.demo import load_tabular_demo

employees = load_tabular_demo()
employees

After defining ``transform`` we create ``reverse_transform`` that reverses the operations made.

.. ipython:: python
:okwarning:
The dataset contains basic details about employees in some fictional companies. Many of the rules
in the dataset can be described using predefined constraints. However, there is one complex rule
that needs a custom constraint:

def reverse_transform(table_data):
base = 500.
table_data['salary'] = table_data['salary'].round() * base
return table_data
- If the employee is not a contractor (contractor == 0), then the salary must be divisible by 500
- Otherwise if the employee is a contractor (contractor == 1), then this rule does not apply

.. note::
This is similar to the predefined :ref:`FixedIncrements <FixedIncrements>` constraint
with the addition of an exclusion criteria (exclude the constraint check if the employee
is a contractor).

Then, we pack every thing together in ``CustomConstraint``.
Validity Check
^^^^^^^^^^^^^^

.. ipython:: python
:okwarning:
The validity check should return a ``numpy.array`` of ``True``/``False`` values that determine
whether each row is valid.

from sdv.constraints import CustomConstraint
Let's code the logic up using parameters:

constraint = CustomConstraint(
transform=transform,
reverse_transform=reverse_transform
)
- **column_names** will be a single item list containing the column that must be divisible
(eg. salary)
- **data** will be the full dataset
- Custom parameter: **increment** describes the numerical increment (eg. 500)
- Custom parameter: **exclusion_column** describes the column with the exclusion criteria
(eg. contractor)

.. code-block:: python

Can I apply the same function to multiple columns?
--------------------------------------------------
def is_valid(column_names, data, increment, exclusion_column):
column_name=column_names[0]

In the example above we fixed the ``salary`` format, but if we continue observing the data
we will see that ``annual_bonus`` is also constrained by the same logic. Rather than
defining two constraints, or editing the code of our functions for each new column that we want
to constraint, we provide another style of writing functions such that the function should accept
a column data as input.
is_divisible = (data[column_name] % increment == 0)
is_excluded = (data[exclusion_column] > 0)

The ``transform`` function takes ``column_data`` as input and returns the transformed column.
return np.array(is_divisible | is_excluded)


.. ipython:: python
:okwarning:
Transformations
^^^^^^^^^^^^^^^

def transform(column_data):
base = 500.
return column_data / base
The transformations must return the full datasets with particular columns transformed. We can
modify, delete or add columns as long as we can reverse the transformation later.

Similarly we defined ``reverse_transform`` in a way that it operates on the data of a
single column.
In our case, the transformation can just divide each of the values in the column by the increment.

.. ipython:: python
:okwarning:
.. code-block:: python

def reverse_transform(column_data):
base = 500.
return column_data.round() * base
def transform(column_names, data, increment, exclusion_column):
column_name = column_names[0]
data[column_name] = data[column_name] / increment
return data

Now that we have our functions, we initialize ``CustomConstraint`` and we
specify which column(s) are the desired ones.

.. ipython:: python
:okwarning:
Reversing the transformation is trickier. If we multiply every value by the increment, the
salaries won't necessarily be divisible by 500. Instead we should:

constraint = CustomConstraint(
columns=['salary', 'annual_bonus'],
transform=transform,
reverse_transform=reverse_transform
)
- Round values to whole numbers whenever the employee is not a contractor first, and then
- Multiply every value by 500

.. code-block:: python

def reverse_transform(column_names, transformed_data, increment, exclusion_column):
column_name = column_names[0]

Can I access the rest of the table from my column functions?
------------------------------------------------------------
included = transformed_data[column_name].loc[(transformed_data[exclusion_column] == 0)]
included = included.round()

If we look closely at the data, we notice that ``salary`` and ``annual_bonus`` are only a
multiple of 500 when the employee is not a "contractor". To take this requirement into
consideration, we refer to a "fixed" column ``contractor`` in order to know whether we
should apply this constraint or not. The access to ``contractor`` column will allow us
to properly transform and reverse transform the data.
transformed_data[column_name] = transformed_data[column_name].multiply(increment).round(2)
return transformed_data

We write our functions to take as input:

- ``table_data`` which contains all the information.
- ``column`` which is a an argument to represent the columns of interest.
Creating your class
~~~~~~~~~~~~~~~~~~~

Now we can construct our functions freely, we write our methods
with said arguments and be able to access ``'contractor'``.
Finally, we can put all the functionality together to create a class that describes our
constraint. Use the **create_custom_constraint** factory method to do this. It accepts your
functions as inputs and returns a class that's ready to use.

We first write our ``transform`` function as we have done previously:
You can name this class whatever you'd like. Since our constraint is similar to
``FixedIncrements``, let's call it ``FixedIncrementsWithExclusion``.

.. ipython:: python
:okwarning:

def transform(table_data, column):
base = 500.
table_data[column] = table_data[column] / base
return table_data
from sdv.constraints import create_custom_constraint

When it comes to defining ``reverse_transform``, we need to distinguish between
contractors and non contractors, the operations are as follows:
FixedIncrementsWithExclusion = create_custom_constraint(
is_valid_fn=is_valid,
transform_fn=transform, # optional
reverse_transform_fn=reverse_transform # optional
)

1. round values to four decimal points for contractors such that the end result will
be two decimal points after multiplying the result with 500.
2. round values to zero for employees that are not contractors such that the end
result will be a multiple of 500.

.. ipython:: python
:okwarning:
Using your custom constraint
----------------------------

def reverse_transform(table_data, column):
base = 500.
is_not_contractor = table_data.contractor == 0.
table_data[column] = table_data[column].round(4)
table_data[column].loc[is_not_contractor] = table_data[column].loc[is_not_contractor].round()
table_data[column] *= base
return table_data
Now that you have a class, you can use it like any other predefined constraint. Create an object
by putting in the parameters you defined. Note that you do not need to input the data.

We now stich everything together and pass it to the model.
You can apply the same constraint to other columns by creating a different object. In our case
the **annual_bonus** column also follows the same logic.

.. ipython:: python
:okwarning:

from sdv.tabular import GaussianCopula

constraint = CustomConstraint(
columns=['salary', 'annual_bonus'],
transform=transform,
reverse_transform=reverse_transform
salary_divis_500 = FixedIncrementsWithExclusion(
column_names=['salary'],
increment=500,
exclusion_column='contractor'
)

gc = GaussianCopula(constraints=[constraint])

gc.fit(employees)

sampled = gc.sample(10)
bonus_divis_500 = FixedIncrementsWithExclusion(
column_names=['annual_bonus'],
increment=500,
exclusion_column='contractor'
)


When we view the ``sampled`` data, we should find that all the rows in the sampled
data have a salary that is a multiple of the base value with the exception
of "contractor" records.
Finally, input these constraints into your model using the constraints parameter just like you
would for predefined constraints.

.. ipython:: python
:okwarning:

sampled

This style gives flexibility to access any column in the table while still operating on
a column basis.


Can I write a ``CustomConstraint`` based on reject sampling?
------------------------------------------------------------

In the previous section, we defined our ``CustomConstraint`` using ``transform`` and
``reverse_transform`` functions. Sometimes, our constraints are not possible to implement
using these methods, that is when we rely on the ``reject_sampling`` strategy.
In ``reject_sampling`` we need to implement an ``is_valid`` function that identifies
which rows do not follow the said constraint, in our case, which rows are not a multiple
of the *base* unit.

We can define ``is_valid`` according to the three styles mentioned in the previous section:

1. function with ``table_data`` argument.
2. function with ``column_data`` argument.
3. function with ``table_data`` and ``column`` argument.

``is_valid`` should return a ``pd.Series`` where every valid row corresponds to *True*,
otherwise it should contain *False*. Here is an example of how you would define
``is_valid`` for each one of the mentioned styles:

.. code-block:: python

def is_valid(table_data):
base = 500.
return table_data['salary'] % base == 0
from sdv.tabular import GaussianCopula

def is_valid(column_data):
base = 500.
return column_data % base == 0
constraints = [
# you can add predefined constraints here too
salary_divis_500,
bonus_divis_500
]

def is_valid(table_data, column):
base = 500.
is_contractor = table_data.contractor == 1
valid = table_data[column] % base == 0
contractor_salary = employees['salary'].loc[is_contractor]
valid.loc[is_contractor] = contractor_salary == contractor_salary.round(2)
return valid
model = GaussianCopula(constraints=constraints, min_value=None, max_value=None)

Then we construct ``CustomConstraint`` to take ``is_valid`` on its own.
model.fit(employees)

.. code-block:: python
Now, when you sample from the model, all rows of the synthetic data will follow the custom
constraint.

constraint = CustomConstraint(
columns=['salary', 'annual_bonus'],
is_valid=is_valid
)
.. ipython:: python
:okwarning:

synthetic_data = model.sample(num_rows=10)
synthetic_data
Loading