Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add method to sample remaining columns (3/3) #708

Merged
Merged
Show file tree
Hide file tree
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
37 changes: 22 additions & 15 deletions docs/user_guides/single_table/copulagan.rst
Original file line number Diff line number Diff line change
Expand Up @@ -688,19 +688,23 @@ Conditional Sampling

As the name implies, conditional sampling allows us to sample from a conditional
distribution using the ``CopulaGAN`` model, which means we can generate only values that
satisfy certain conditions. These conditional values can be passed to the ``conditions``
parameter in the ``sample`` method either as a dataframe or a dictionary.
satisfy certain conditions. These conditional values can be passed to the ``sample_conditions``
method as a list of ``sdv.sampling.Condition`` objects or to the ``sample_remaining_columns`` method
as a dataframe.

In case a dictionary is passed, the model will generate as many rows as requested,
all of which will satisfy the specified conditions, such as ``gender = M``.
When specifying a ``sdv.sampling.Condition`` object, we can pass in the desired conditions
as a dictionary, as well as specify the number of desired rows for that condition.

.. ipython:: python
:okwarning:

conditions = {
from sdv.sampling import Condition

condition = Condition({
'gender': 'M'
}
model.sample(5, conditions=conditions)
}, num_rows=5)

model.sample_conditions(conditions=[condition])


It's also possible to condition on multiple columns, such as
Expand All @@ -709,14 +713,16 @@ It's also possible to condition on multiple columns, such as
.. ipython:: python
:okwarning:

conditions = {
condition = Condition({
'gender': 'M',
'experience_years': 0
}
model.sample(5, conditions=conditions)
}, num_rows=5)

model.sample_conditions(conditions=[condition])

The ``conditions`` can also be passed as a dataframe. In that case, the model

In the ``sample_remaining_columns`` method, ``conditions`` is
passed as a dataframe. In that case, the model
will generate one sample for each row of the dataframe, sorted in the same
order. Since the model already knows how many samples to generate, passing
it as a parameter is unnecessary. For example, if we want to generate three
Expand All @@ -731,7 +737,7 @@ following:
conditions = pd.DataFrame({
'gender': ['M', 'M', 'M', 'F', 'F', 'F'],
})
model.sample(conditions=conditions)
model.sample_remaining_columns(conditions)


``CopulaGAN`` also supports conditioning on continuous values, as long as the values
Expand All @@ -741,10 +747,11 @@ dataset are within 0 and 1, ``CopulaGAN`` will not be able to set this value to
.. ipython:: python
:okwarning:

conditions = {
condition = Condition({
'degree_perc': 70.0
}
model.sample(5, conditions=conditions)
}, num_rows=5)

model.sample_conditions(conditions=[condition])


.. note::
Expand Down
37 changes: 22 additions & 15 deletions docs/user_guides/single_table/ctgan.rst
Original file line number Diff line number Diff line change
Expand Up @@ -499,19 +499,23 @@ Conditional Sampling

As the name implies, conditional sampling allows us to sample from a conditional
distribution using the ``CTGAN`` model, which means we can generate only values that
satisfy certain conditions. These conditional values can be passed to the ``conditions``
parameter in the ``sample`` method either as a dataframe or a dictionary.
satisfy certain conditions. These conditional values can be passed to the ``sample_conditions``
method as a list of ``sdv.sampling.Condition`` objects or to the ``sample_remaining_columns``
method as a dataframe.

In case a dictionary is passed, the model will generate as many rows as requested,
all of which will satisfy the specified conditions, such as ``gender = M``.
When specifying a ``sdv.sampling.Condition`` object, we can pass in the desired conditions
as a dictionary, as well as specify the number of desired rows for that condition.

.. ipython:: python
:okwarning:

conditions = {
from sdv.sampling import Condition

condition = Condition({
'gender': 'M'
}
model.sample(5, conditions=conditions)
}, num_rows=5)

model.sample_conditions(conditions=[condition])


It's also possible to condition on multiple columns, such as
Expand All @@ -520,14 +524,16 @@ It's also possible to condition on multiple columns, such as
.. ipython:: python
:okwarning:

conditions = {
condition = Condition({
'gender': 'M',
'experience_years': 0
}
model.sample(5, conditions=conditions)
}, num_rows=5)

model.sample_conditions(conditions=[condition])

The ``conditions`` can also be passed as a dataframe. In that case, the model

In the ``sample_remaining_columns`` method, ``conditions`` is
passed as a dataframe. In that case, the model
will generate one sample for each row of the dataframe, sorted in the same
order. Since the model already knows how many samples to generate, passing
it as a parameter is unnecessary. For example, if we want to generate three
Expand All @@ -542,7 +548,7 @@ following:
conditions = pd.DataFrame({
'gender': ['M', 'M', 'M', 'F', 'F', 'F'],
})
model.sample(conditions=conditions)
model.sample_remaining_columns(conditions)


``CTGAN`` also supports conditioning on continuous values, as long as the values
Expand All @@ -552,10 +558,11 @@ dataset are within 0 and 1, ``CTGAN`` will not be able to set this value to 1000
.. ipython:: python
:okwarning:

conditions = {
condition = Condition({
'degree_perc': 70.0
}
model.sample(5, conditions=conditions)
}, num_rows=5)

model.sample_conditions(conditions=[condition])


.. note::
Expand Down
37 changes: 22 additions & 15 deletions docs/user_guides/single_table/gaussian_copula.rst
Original file line number Diff line number Diff line change
Expand Up @@ -648,19 +648,23 @@ Conditional Sampling

As the name implies, conditional sampling allows us to sample from a conditional
distribution using the ``GaussianCopula`` model, which means we can generate only values that
satisfy certain conditions. These conditional values can be passed to the ``conditions``
parameter in the ``sample`` method either as a dataframe or a dictionary.
satisfy certain conditions. These conditional values can be passed to the ``sample_conditions``
method as a list of ``sdv.sampling.Condition`` objects or to the ``sample_remaining_columns``
method as a dataframe.

In case a dictionary is passed, the model will generate as many rows as requested,
all of which will satisfy the specified conditions, such as ``gender = M``.
When specifying a ``sdv.sampling.Condition`` object, we can pass in the desired conditions
as a dictionary, as well as specify the number of desired rows for that condition.

.. ipython:: python
:okwarning:

conditions = {
from sdv.sampling import Condition

condition = Condition({
'gender': 'M'
}
model.sample(5, conditions=conditions)
}, num_rows=5)

model.sample_conditions(conditions=[condition])


It's also possible to condition on multiple columns, such as
Expand All @@ -669,14 +673,16 @@ It's also possible to condition on multiple columns, such as
.. ipython:: python
:okwarning:

conditions = {
condition = Condition({
'gender': 'M',
'experience_years': 0
}
model.sample(5, conditions=conditions)
}, num_rows=5)

model.sample_conditions(conditions=[condition])

The ``conditions`` can also be passed as a dataframe. In that case, the model

In the ``sample_remaining_columns`` method, ``conditions`` is
passed as a dataframe. In that case, the model
will generate one sample for each row of the dataframe, sorted in the same
order. Since the model already knows how many samples to generate, passing
it as a parameter is unnecessary. For example, if we want to generate three
Expand All @@ -691,7 +697,7 @@ following:
conditions = pd.DataFrame({
'gender': ['M', 'M', 'M', 'F', 'F', 'F'],
})
model.sample(conditions=conditions)
model.sample_remaining_columns(conditions)


``GaussianCopula`` also supports conditioning on continuous values, as long as the values
Expand All @@ -701,10 +707,11 @@ dataset are within 0 and 1, ``GaussianCopula`` will not be able to set this valu
.. ipython:: python
:okwarning:

conditions = {
condition = Condition({
'degree_perc': 70.0
}
model.sample(5, conditions=conditions)
}, num_rows=5)

model.sample_conditions(conditions=[condition])


.. note::
Expand Down
36 changes: 21 additions & 15 deletions docs/user_guides/single_table/tvae.rst
Original file line number Diff line number Diff line change
Expand Up @@ -484,19 +484,22 @@ Conditional Sampling

As the name implies, conditional sampling allows us to sample from a conditional
distribution using the ``TVAE`` model, which means we can generate only values that
satisfy certain conditions. These conditional values can be passed to the ``conditions``
parameter in the ``sample`` method either as a dataframe or a dictionary.
satisfy certain conditions. These conditional values can be passed to the ``sample_conditions``
method as a list of ``sdv.sampling.Condition`` objects or to the ``sample_remaining_columns``
method as a dataframe.

In case a dictionary is passed, the model will generate as many rows as requested,
all of which will satisfy the specified conditions, such as ``gender = M``.
When specifying a ``sdv.sampling.Condition`` object, we can pass in the desired conditions as a dictionary, as well as specify the number of desired rows for that condition.

.. ipython:: python
:okwarning:

conditions = {
from sdv.sampling import Condition

condition = Condition({
'gender': 'M'
}
model.sample(5, conditions=conditions)
}, num_rows=5)

model.sample_conditions(conditions=[condition])


It's also possible to condition on multiple columns, such as
Expand All @@ -505,14 +508,16 @@ It's also possible to condition on multiple columns, such as
.. ipython:: python
:okwarning:

conditions = {
condition = Condition({
'gender': 'M',
'experience_years': 0
}
model.sample(5, conditions=conditions)
}, num_rows=5)

model.sample_conditions(conditions=[condition])

The ``conditions`` can also be passed as a dataframe. In that case, the model

In the ``sample_remaining_columns`` method, ``conditions`` is
passed as a dataframe. In that case, the model
will generate one sample for each row of the dataframe, sorted in the same
order. Since the model already knows how many samples to generate, passing
it as a parameter is unnecessary. For example, if we want to generate three
Expand All @@ -527,7 +532,7 @@ following:
conditions = pd.DataFrame({
'gender': ['M', 'M', 'M', 'F', 'F', 'F'],
})
model.sample(conditions=conditions)
model.sample_remaining_columns(conditions)


``TVAE`` also supports conditioning on continuous values, as long as the values
Expand All @@ -537,10 +542,11 @@ dataset are within 0 and 1, ``TVAE`` will not be able to set this value to 1000.
.. ipython:: python
:okwarning:

conditions = {
condition = Condition({
'degree_perc': 70.0
}
model.sample(5, conditions=conditions)
}, num_rows=5)

model.sample_conditions(conditions=[condition])


.. note::
Expand Down
40 changes: 35 additions & 5 deletions sdv/tabular/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -409,7 +409,7 @@ def sample(self, num_rows, randomize_samples=True):
num_rows (int):
Number of rows to sample. This parameter is required.
randomize_samples (bool):
Whether or not to use a a fixed seed when sampling. Defaults
Whether or not to use a fixed seed when sampling. Defaults
to True.

Returns:
Expand Down Expand Up @@ -443,13 +443,11 @@ def _sample_with_conditions(self, conditions, max_tries, batch_size_per_try):
ValueError:
If any of the following happens:
* any of the conditions' columns are not valid.
* `graceful_reject_sampling` is `False` and not enough valid rows could be
sampled within `max_tries` trials.
* no rows could be generated.
"""
for column in conditions.columns:
if column not in self._metadata.get_fields():
raise ValueError(f'Error: Unexpected column name `{column}`. '
raise ValueError(f'Unexpected column name `{column}`. '
f'Use a column name that was present in the original data.')

try:
Expand Down Expand Up @@ -524,7 +522,7 @@ def sample_conditions(self, conditions, max_tries=100, batch_size_per_try=None,
The batch size to use per attempt at sampling. Defaults to 10 times
the number of rows.
randomize_samples (bool):
Whether or not to use a a fixed seed when sampling. Defaults
Whether or not to use a fixed seed when sampling. Defaults
to True.

Returns:
Expand All @@ -549,6 +547,38 @@ def sample_conditions(self, conditions, max_tries=100, batch_size_per_try=None,

return sampled

def sample_remaining_columns(self, known_columns, max_tries=100, batch_size_per_try=None,
randomize_samples=True):
"""Sample rows from this table.

Args:
known_columns (pandas.DataFrame):
A pandas.DataFrame with the columns that are already known. The output
is a DataFrame such that each row in the output is sampled
conditionally on the corresponding row in the input.
max_tries (int):
Number of times to try sampling discarded rows. Defaults to 100.
batch_size_per_try (int):
The batch size to use per attempt at sampling. Defaults to 10 times
the number of rows.
randomize_samples (bool):
Whether or not to use a fixed seed when sampling. Defaults
to True.

Returns:
pandas.DataFrame:
Sampled data.

Raises:
ConstraintsNotMetError:
If the conditions are not valid for the given constraints.
ValueError:
If any of the following happens:
* any of the conditions' columns are not valid.
* no rows could be generated.
"""
return self._sample_with_conditions(known_columns, max_tries, batch_size_per_try)

def _get_parameters(self):
raise NonParametricError()

Expand Down
Loading