Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

User Guide code fixes #989

Merged
merged 2 commits into from
Sep 1, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 3 additions & 6 deletions docs/api_reference/metrics/relational.rst
Original file line number Diff line number Diff line change
Expand Up @@ -35,12 +35,9 @@ Multi Table Statistical Metrics
CSTest
CSTest.get_subclasses
CSTest.compute
KSTest
KSTest.get_subclasses
KSTest.compute
KSTestExtended
KSTestExtended.get_subclasses
KSTestExtended.compute
KSComplement
KSComplement.get_subclasses
KSComplement.compute

Multi Table Detection Metrics
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Expand Down
9 changes: 3 additions & 6 deletions docs/api_reference/metrics/tabular.rst
Original file line number Diff line number Diff line change
Expand Up @@ -37,12 +37,9 @@ Single Table Statistical Metrics
CSTest
CSTest.get_subclasses
CSTest.compute
KSTest
KSTest.get_subclasses
KSTest.compute
KSTestExtended
KSTestExtended.get_subclasses
KSTestExtended.compute
KSComplement
KSComplement.get_subclasses
KSComplement.compute
ContinuousKLDivergence
ContinuousKLDivergence.get_subclasses
ContinuousKLDivergence.compute
Expand Down
8 changes: 4 additions & 4 deletions docs/user_guides/evaluation/evaluation_framework.rst
Original file line number Diff line number Diff line change
Expand Up @@ -98,21 +98,21 @@ are included within the SDV Evaluation framework. However, the list of
metrics that are applied can be controlled by passing a list with the
names of the metrics that you want to apply.

For example, if you were interested on obtaining only the ``CSTest`` and
``KSTest`` metrics you can call the ``evaluate`` function as follows:
For example, if you were interested on obtaining only the ``CSTest``
metric you can call the ``evaluate`` function as follows:

.. ipython:: python
:okwarning:

evaluate(synthetic_data, real_data, metrics=['CSTest', 'KSTest'])
evaluate(synthetic_data, real_data, metrics=['CSTest'])


Or, if we want to see the scores separately:

.. ipython:: python
:okwarning:

evaluate(synthetic_data, real_data, metrics=['CSTest', 'KSTest'], aggregate=False)
evaluate(synthetic_data, real_data, metrics=['CSTest'], aggregate=False)


For more details about all the metrics that exist for the different data modalities
Expand Down
9 changes: 4 additions & 5 deletions docs/user_guides/evaluation/multi_table_metrics.rst
Original file line number Diff line number Diff line change
Expand Up @@ -153,21 +153,20 @@ report back the average score obtained.
The list of such metrics is:

* ``CSTest``: Multi Single Table metric based on the Single Table CSTest metric.
* ``KSTest``: Multi Single Table metric based on the Single Table KSTest metric.
* ``KSTestExtended``: Multi Single Table metric based on the Single Table KSTestExtended metric.
* ``KSComplement``: Multi Single Table metric based on the Single Table KSComplement metric.
* ``LogisticDetection``: Multi Single Table metric based on the Single Table LogisticDetection metric.
* ``SVCDetection``: Multi Single Table metric based on the Single Table SVCDetection metric.
* ``BNLikelihood``: Multi Single Table metric based on the Single Table BNLikelihood metric.
* ``BNLogLikelihood``: Multi Single Table metric based on the Single Table BNLogLikelihood metric.

Let's try to use the ``KSTestExtended`` metric:
Let's try to use the ``KSComplement`` metric:

.. ipython::
:verbatim:

In [6]: from sdv.metrics.relational import KSTestExtended
In [6]: from sdv.metrics.relational import KSComplement

In [7]: KSTestExtended.compute(real_data, synthetic_data)
In [7]: KSComplement.compute(real_data, synthetic_data)
Out[7]: 0.8194444444444443

Parent Child Detection Metrics
Expand Down
12 changes: 6 additions & 6 deletions docs/user_guides/evaluation/single_table_metrics.rst
Original file line number Diff line number Diff line change
Expand Up @@ -136,7 +136,7 @@ outcome from the test.

Such metrics are:

* ``sdv.metrics.tabular.KSTest``: This metric uses the two-sample Kolmogorov–Smirnov test
* ``sdv.metrics.tabular.KSComplement``: This metric uses the two-sample Kolmogorov–Smirnov test
to compare the distributions of continuous columns using the empirical CDF.
The output for each column is 1 minus the KS Test D statistic, which indicates the maximum
distance between the expected CDF and the observed CDF values.
Expand All @@ -150,16 +150,16 @@ Let us execute these two metrics on the loaded data:
.. ipython::
:verbatim:

In [6]: from sdv.metrics.tabular import CSTest, KSTest
In [6]: from sdv.metrics.tabular import CSTest, KSComplement

In [7]: CSTest.compute(real_data, synthetic_data)
Out[7]: 0.8078084931103922

In [8]: KSTest.compute(real_data, synthetic_data)
In [8]: KSComplement.compute(real_data, synthetic_data)
Out[8]: 0.6372093023255814

In each case, the statistical test will be executed on all the compatible column (so, categorical
or boolean columns for ``CSTest`` and numerical columns for ``KSTest``), and report the average
or boolean columns for ``CSTest`` and numerical columns for ``KSComplement``), and report the average
score obtained.

.. note:: If your table does not contain any column of the compatible type, the output of
Expand All @@ -173,11 +173,11 @@ metric classes or their names:

In [9]: from sdv.evaluation import evaluate

In [10]: evaluate(synthetic_data, real_data, metrics=['CSTest', 'KSTest'], aggregate=False)
In [10]: evaluate(synthetic_data, real_data, metrics=['CSTest', 'KSComplement'], aggregate=False)
Out[10]:
metric name raw_score normalized_score min_value max_value goal
0 CSTest Chi-Squared 0.807808 0.807808 0.0 1.0 MAXIMIZE
1 KSTest Inverted Kolmogorov-Smirnov D statistic 0.637209 0.637209 0.0 1.0 MAXIMIZE
1 KSComplement Inverted Kolmogorov-Smirnov D statistic 0.637209 0.637209 0.0 1.0 MAXIMIZE


Likelihood Metrics
Expand Down
44 changes: 2 additions & 42 deletions docs/user_guides/single_table/copulagan.rst
Original file line number Diff line number Diff line change
Expand Up @@ -346,44 +346,6 @@ Now that we have discovered the basics, let's go over a few more
advanced usage examples and see the different arguments that we can pass
to our ``CopulaGAN`` Model in order to customize it to our needs.

Setting Bounds and Specifying Rounding for Numerical Columns
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

By default, the model will learn the upper and lower bounds of the
input data, and use that for sampling. This means that all sampled data
will be between the maximum and minimum values found in the original
dataset for each numeric column. This option can be overwritten using the
``min_value`` and ``max_value`` model arguments. These values can either
be set to a numeric value, set to ``'auto'`` which is the default setting,
or set to ``None`` which will mean the column is boundless.

The model will also learn the number of decimal places to round to by default.
This option can be overwritten using the ``rounding`` parameter. The value can
be an int specifying how many decimal places to round to, ``'auto'`` which is
the default setting, or ``None`` which means the data will not be rounded.

Since we may want to sample values outside of the ranges in the original data,
let's pass the ``min_value`` and ``max_value`` arguments as `None` to the model.
To keep the number of decimals consistent across columns, we can set ``rounding``
to be 2.

.. ipython:: python
:okwarning:

model = CopulaGAN(
primary_key='student_id',
min_value=None,
max_value=None,
rounding=2
)
model.fit(data)

unbounded_data = model.sample(10)
unbounded_data

As you may notice, the sampled data may have values outside the range of
the original data.

Exploring the Probability Distributions
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Expand Down Expand Up @@ -423,8 +385,7 @@ in our table. We can explore the distributions which the

model = CopulaGAN(
primary_key='student_id',
min_value=None,
max_value=None
enforce_min_max_values=False
)
model.fit(data)
distributions = model.get_distributions()
Expand Down Expand Up @@ -520,8 +481,7 @@ Let's see what happens if we make the ``CopulaGAN`` use the
field_distributions={
'experience_years': 'gamma'
},
min_value=None,
max_value=None
enforce_min_max_values=False
)
model.fit(data)

Expand Down
38 changes: 0 additions & 38 deletions docs/user_guides/single_table/ctgan.rst
Original file line number Diff line number Diff line change
Expand Up @@ -345,44 +345,6 @@ Now that we have discovered the basics, let's go over a few more
advanced usage examples and see the different arguments that we can pass
to our ``CTGAN`` Model in order to customize it to our needs.

Setting Bounds and Specifying Rounding for Numerical Columns
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

By default, the model will learn the upper and lower bounds of the
input data, and use that for sampling. This means that all sampled data
will be between the maximum and minimum values found in the original
dataset for each numeric column. This option can be overwritten using the
``min_value`` and ``max_value`` model arguments. These values can either
be set to a numeric value, set to ``'auto'`` which is the default setting,
or set to ``None`` which will mean the column is boundless.

The model will also learn the number of decimal places to round to by default.
This option can be overwritten using the ``rounding`` parameter. The value can
be an int specifying how many decimal places to round to, ``'auto'`` which is
the default setting, or ``None`` which means the data will not be rounded.

Since we may want to sample values outside of the ranges in the original data,
let's pass the ``min_value`` and ``max_value`` arguments as `None` to the model.
To keep the number of decimals consistent across columns, we can set ``rounding``
to be 2.

.. ipython:: python
:okwarning:

model = CTGAN(
primary_key='student_id',
min_value=None,
max_value=None,
rounding=2
)
model.fit(data)

unbounded_data = model.sample(10)
unbounded_data

As you may notice, the sampled data may have values outside the range of
the original data.

How to modify the CTGAN Hyperparameters?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Expand Down
2 changes: 1 addition & 1 deletion docs/user_guides/single_table/custom_constraints.rst
Original file line number Diff line number Diff line change
Expand Up @@ -174,7 +174,7 @@ would for predefined constraints.
bonus_divis_500
]

model = GaussianCopula(constraints=constraints, min_value=None, max_value=None)
model = GaussianCopula(constraints=constraints, enforce_min_max_values=False)

model.fit(employees)

Expand Down
44 changes: 2 additions & 42 deletions docs/user_guides/single_table/gaussian_copula.rst
Original file line number Diff line number Diff line change
Expand Up @@ -350,44 +350,6 @@ Now that we have discovered the basics, let's go over a few more
advanced usage examples and see the different arguments that we can pass
to our ``GaussianCopula`` Model in order to customize it to our needs.

Setting Bounds and Specifying Rounding for Numerical Columns
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

By default, the model will learn the upper and lower bounds of the
input data, and use that for sampling. This means that all sampled data
will be between the maximum and minimum values found in the original
dataset for each numeric column. This option can be overwritten using the
``min_value`` and ``max_value`` model arguments. These values can either
be set to a numeric value, set to ``'auto'`` which is the default setting,
or set to ``None`` which will mean the column is boundless.

The model will also learn the number of decimal places to round to by default.
This option can be overwritten using the ``rounding`` parameter. The value can
be an int specifying how many decimal places to round to, ``'auto'`` which is
the default setting, or ``None`` which means the data will not be rounded.

Since we may want to sample values outside of the ranges in the original data,
let's pass the ``min_value`` and ``max_value`` arguments as `None` to the model.
To keep the number of decimals consistent across columns, we can set ``rounding``
to be 2.

.. ipython:: python
:okwarning:

model = GaussianCopula(
primary_key='student_id',
min_value=None,
max_value=None,
rounding=2
)
model.fit(data)

unbounded_data = model.sample(10)
unbounded_data

As you may notice, the sampled data may have values outside the range of
the original data.

Exploring the Probability Distributions
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Expand Down Expand Up @@ -427,8 +389,7 @@ in our table. We can explore the distributions which the

model = GaussianCopula(
primary_key='student_id',
min_value=None,
max_value=None
enforce_min_max_values=False
)
model.fit(data)
distributions = model.get_distributions()
Expand Down Expand Up @@ -526,8 +487,7 @@ Let's see what happens if we make the ``GaussianCopula`` use the
field_distributions={
'experience_years': 'gamma'
},
min_value=None,
max_value=None
enforce_min_max_values=False
)
model.fit(data)

Expand Down
14 changes: 7 additions & 7 deletions docs/user_guides/single_table/handling_constraints.rst
Original file line number Diff line number Diff line change
Expand Up @@ -129,8 +129,8 @@ datetime column name and value. It also expects an inequality relation that must
)

.. note::
All SDV tabular models have min_value and max_value parameters that you set to enforce bounds
on all columns. This constraint is redundant if you set these model parameters.
All SDV tabular models have an enforce_min_max_values parameter that you set to enforce bounds
on all columns. This constraint is redundant if you set this model parameter.

Positive and Negative
~~~~~~~~~~~~~~~~~~~~~
Expand All @@ -150,8 +150,8 @@ Enforce this by creating a Positive constraint. This object accepts a numerical
age_positive = Positive(column_name='age')

.. note::
All SDV tabular models have min_value and max_value parameters that you set to enforce bounds
on all columns. This constraint is redundant if you set these model parameters.
All SDV tabular models have an enforce_min_max_value parameter that you set to enforce bounds
on all columns. This constraint is redundant if you set this model parameter.

OneHotEncoding
~~~~~~~~~~~~~~
Expand Down Expand Up @@ -250,8 +250,8 @@ ranges are strict (exclusive) or not (inclusive).
)

.. note::
All SDV tabular models have min_value and max_value parameters that you set to enforce bounds
on all columns. This constraint is redundant if you set these model parameters.
All SDV tabular models have an enforce_min_max_values parameter that you set to enforce bounds
on all columns. This constraint is redundant if you set this model parameter.

Applying the Constraints
------------------------
Expand All @@ -272,7 +272,7 @@ to pass in the objects a list.
age_btwn_18_100
]

model = GaussianCopula(constraints=constraints, min_value=None, max_value=None)
model = GaussianCopula(constraints=constraints, enforce_min_max_values=False)

Then you can fit the model using the real data. During this process, the SDV ensures that the
model learns the constraints.
Expand Down
38 changes: 0 additions & 38 deletions docs/user_guides/single_table/tvae.rst
Original file line number Diff line number Diff line change
Expand Up @@ -345,44 +345,6 @@ Now that we have discovered the basics, let's go over a few more
advanced usage examples and see the different arguments that we can pass
to our ``CTGAN`` Model in order to customize it to our needs.

Setting Bounds and Specifying Rounding for Numerical Columns
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

By default, the model will learn the upper and lower bounds of the
input data, and use that for sampling. This means that all sampled data
will be between the maximum and minimum values found in the original
dataset for each numeric column. This option can be overwritten using the
``min_value`` and ``max_value`` model arguments. These values can either
be set to a numeric value, set to ``'auto'`` which is the default setting,
or set to ``None`` which will mean the column is boundless.

The model will also learn the number of decimal places to round to by default.
This option can be overwritten using the ``rounding`` parameter. The value can
be an int specifying how many decimal places to round to, ``'auto'`` which is
the default setting, or ``None`` which means the data will not be rounded.

Since we may want to sample values outside of the ranges in the original data,
let's pass the ``min_value`` and ``max_value`` arguments as `None` to the model.
To keep the number of decimals consistent across columns, we can set ``rounding``
to be 2.

.. ipython:: python
:okwarning:

model = TVAE(
primary_key='student_id',
min_value=None,
max_value=None,
rounding=2
)
model.fit(data)

unbounded_data = model.sample(10)
unbounded_data

As you may notice, the sampled data may have values outside the range of
the original data.

How to modify the TVAE Hyperparameters?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Expand Down
Loading