diff --git a/docs/api_reference/metrics/relational.rst b/docs/api_reference/metrics/relational.rst index c42ec89cc..f5cae5734 100644 --- a/docs/api_reference/metrics/relational.rst +++ b/docs/api_reference/metrics/relational.rst @@ -35,12 +35,9 @@ Multi Table Statistical Metrics CSTest CSTest.get_subclasses CSTest.compute - KSTest - KSTest.get_subclasses - KSTest.compute - KSTestExtended - KSTestExtended.get_subclasses - KSTestExtended.compute + KSComplement + KSComplement.get_subclasses + KSComplement.compute Multi Table Detection Metrics ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ diff --git a/docs/api_reference/metrics/tabular.rst b/docs/api_reference/metrics/tabular.rst index 6badf6225..dda155e82 100644 --- a/docs/api_reference/metrics/tabular.rst +++ b/docs/api_reference/metrics/tabular.rst @@ -37,12 +37,9 @@ Single Table Statistical Metrics CSTest CSTest.get_subclasses CSTest.compute - KSTest - KSTest.get_subclasses - KSTest.compute - KSTestExtended - KSTestExtended.get_subclasses - KSTestExtended.compute + KSComplement + KSComplement.get_subclasses + KSComplement.compute ContinuousKLDivergence ContinuousKLDivergence.get_subclasses ContinuousKLDivergence.compute diff --git a/docs/user_guides/evaluation/evaluation_framework.rst b/docs/user_guides/evaluation/evaluation_framework.rst index 4757ca717..ae7f0662d 100644 --- a/docs/user_guides/evaluation/evaluation_framework.rst +++ b/docs/user_guides/evaluation/evaluation_framework.rst @@ -98,13 +98,13 @@ are included within the SDV Evaluation framework. However, the list of metrics that are applied can be controlled by passing a list with the names of the metrics that you want to apply. -For example, if you were interested on obtaining only the ``CSTest`` and -``KSTest`` metrics you can call the ``evaluate`` function as follows: +For example, if you were interested on obtaining only the ``CSTest`` +metric you can call the ``evaluate`` function as follows: .. ipython:: python :okwarning: - evaluate(synthetic_data, real_data, metrics=['CSTest', 'KSTest']) + evaluate(synthetic_data, real_data, metrics=['CSTest']) Or, if we want to see the scores separately: @@ -112,7 +112,7 @@ Or, if we want to see the scores separately: .. ipython:: python :okwarning: - evaluate(synthetic_data, real_data, metrics=['CSTest', 'KSTest'], aggregate=False) + evaluate(synthetic_data, real_data, metrics=['CSTest'], aggregate=False) For more details about all the metrics that exist for the different data modalities diff --git a/docs/user_guides/evaluation/multi_table_metrics.rst b/docs/user_guides/evaluation/multi_table_metrics.rst index 42f8af53b..f151e94d8 100644 --- a/docs/user_guides/evaluation/multi_table_metrics.rst +++ b/docs/user_guides/evaluation/multi_table_metrics.rst @@ -153,21 +153,20 @@ report back the average score obtained. The list of such metrics is: * ``CSTest``: Multi Single Table metric based on the Single Table CSTest metric. -* ``KSTest``: Multi Single Table metric based on the Single Table KSTest metric. -* ``KSTestExtended``: Multi Single Table metric based on the Single Table KSTestExtended metric. +* ``KSComplement``: Multi Single Table metric based on the Single Table KSComplement metric. * ``LogisticDetection``: Multi Single Table metric based on the Single Table LogisticDetection metric. * ``SVCDetection``: Multi Single Table metric based on the Single Table SVCDetection metric. * ``BNLikelihood``: Multi Single Table metric based on the Single Table BNLikelihood metric. * ``BNLogLikelihood``: Multi Single Table metric based on the Single Table BNLogLikelihood metric. -Let's try to use the ``KSTestExtended`` metric: +Let's try to use the ``KSComplement`` metric: .. ipython:: :verbatim: - In [6]: from sdv.metrics.relational import KSTestExtended + In [6]: from sdv.metrics.relational import KSComplement - In [7]: KSTestExtended.compute(real_data, synthetic_data) + In [7]: KSComplement.compute(real_data, synthetic_data) Out[7]: 0.8194444444444443 Parent Child Detection Metrics diff --git a/docs/user_guides/evaluation/single_table_metrics.rst b/docs/user_guides/evaluation/single_table_metrics.rst index eae3e4aa8..440a7ef5f 100644 --- a/docs/user_guides/evaluation/single_table_metrics.rst +++ b/docs/user_guides/evaluation/single_table_metrics.rst @@ -136,7 +136,7 @@ outcome from the test. Such metrics are: -* ``sdv.metrics.tabular.KSTest``: This metric uses the two-sample Kolmogorov–Smirnov test +* ``sdv.metrics.tabular.KSComplement``: This metric uses the two-sample Kolmogorov–Smirnov test to compare the distributions of continuous columns using the empirical CDF. The output for each column is 1 minus the KS Test D statistic, which indicates the maximum distance between the expected CDF and the observed CDF values. @@ -150,16 +150,16 @@ Let us execute these two metrics on the loaded data: .. ipython:: :verbatim: - In [6]: from sdv.metrics.tabular import CSTest, KSTest + In [6]: from sdv.metrics.tabular import CSTest, KSComplement In [7]: CSTest.compute(real_data, synthetic_data) Out[7]: 0.8078084931103922 - In [8]: KSTest.compute(real_data, synthetic_data) + In [8]: KSComplement.compute(real_data, synthetic_data) Out[8]: 0.6372093023255814 In each case, the statistical test will be executed on all the compatible column (so, categorical -or boolean columns for ``CSTest`` and numerical columns for ``KSTest``), and report the average +or boolean columns for ``CSTest`` and numerical columns for ``KSComplement``), and report the average score obtained. .. note:: If your table does not contain any column of the compatible type, the output of @@ -173,11 +173,11 @@ metric classes or their names: In [9]: from sdv.evaluation import evaluate - In [10]: evaluate(synthetic_data, real_data, metrics=['CSTest', 'KSTest'], aggregate=False) + In [10]: evaluate(synthetic_data, real_data, metrics=['CSTest', 'KSComplement'], aggregate=False) Out[10]: metric name raw_score normalized_score min_value max_value goal 0 CSTest Chi-Squared 0.807808 0.807808 0.0 1.0 MAXIMIZE - 1 KSTest Inverted Kolmogorov-Smirnov D statistic 0.637209 0.637209 0.0 1.0 MAXIMIZE + 1 KSComplement Inverted Kolmogorov-Smirnov D statistic 0.637209 0.637209 0.0 1.0 MAXIMIZE Likelihood Metrics diff --git a/docs/user_guides/single_table/copulagan.rst b/docs/user_guides/single_table/copulagan.rst index 5d788f4bf..64d0f5ee5 100644 --- a/docs/user_guides/single_table/copulagan.rst +++ b/docs/user_guides/single_table/copulagan.rst @@ -346,44 +346,6 @@ Now that we have discovered the basics, let's go over a few more advanced usage examples and see the different arguments that we can pass to our ``CopulaGAN`` Model in order to customize it to our needs. -Setting Bounds and Specifying Rounding for Numerical Columns -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -By default, the model will learn the upper and lower bounds of the -input data, and use that for sampling. This means that all sampled data -will be between the maximum and minimum values found in the original -dataset for each numeric column. This option can be overwritten using the -``min_value`` and ``max_value`` model arguments. These values can either -be set to a numeric value, set to ``'auto'`` which is the default setting, -or set to ``None`` which will mean the column is boundless. - -The model will also learn the number of decimal places to round to by default. -This option can be overwritten using the ``rounding`` parameter. The value can -be an int specifying how many decimal places to round to, ``'auto'`` which is -the default setting, or ``None`` which means the data will not be rounded. - -Since we may want to sample values outside of the ranges in the original data, -let's pass the ``min_value`` and ``max_value`` arguments as `None` to the model. -To keep the number of decimals consistent across columns, we can set ``rounding`` -to be 2. - -.. ipython:: python - :okwarning: - - model = CopulaGAN( - primary_key='student_id', - min_value=None, - max_value=None, - rounding=2 - ) - model.fit(data) - - unbounded_data = model.sample(10) - unbounded_data - -As you may notice, the sampled data may have values outside the range of -the original data. - Exploring the Probability Distributions ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ @@ -423,8 +385,7 @@ in our table. We can explore the distributions which the model = CopulaGAN( primary_key='student_id', - min_value=None, - max_value=None + enforce_min_max_values=False ) model.fit(data) distributions = model.get_distributions() @@ -520,8 +481,7 @@ Let's see what happens if we make the ``CopulaGAN`` use the field_distributions={ 'experience_years': 'gamma' }, - min_value=None, - max_value=None + enforce_min_max_values=False ) model.fit(data) diff --git a/docs/user_guides/single_table/ctgan.rst b/docs/user_guides/single_table/ctgan.rst index 863dbbaf1..1c5a1d97f 100644 --- a/docs/user_guides/single_table/ctgan.rst +++ b/docs/user_guides/single_table/ctgan.rst @@ -345,44 +345,6 @@ Now that we have discovered the basics, let's go over a few more advanced usage examples and see the different arguments that we can pass to our ``CTGAN`` Model in order to customize it to our needs. -Setting Bounds and Specifying Rounding for Numerical Columns -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -By default, the model will learn the upper and lower bounds of the -input data, and use that for sampling. This means that all sampled data -will be between the maximum and minimum values found in the original -dataset for each numeric column. This option can be overwritten using the -``min_value`` and ``max_value`` model arguments. These values can either -be set to a numeric value, set to ``'auto'`` which is the default setting, -or set to ``None`` which will mean the column is boundless. - -The model will also learn the number of decimal places to round to by default. -This option can be overwritten using the ``rounding`` parameter. The value can -be an int specifying how many decimal places to round to, ``'auto'`` which is -the default setting, or ``None`` which means the data will not be rounded. - -Since we may want to sample values outside of the ranges in the original data, -let's pass the ``min_value`` and ``max_value`` arguments as `None` to the model. -To keep the number of decimals consistent across columns, we can set ``rounding`` -to be 2. - -.. ipython:: python - :okwarning: - - model = CTGAN( - primary_key='student_id', - min_value=None, - max_value=None, - rounding=2 - ) - model.fit(data) - - unbounded_data = model.sample(10) - unbounded_data - -As you may notice, the sampled data may have values outside the range of -the original data. - How to modify the CTGAN Hyperparameters? ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ diff --git a/docs/user_guides/single_table/custom_constraints.rst b/docs/user_guides/single_table/custom_constraints.rst index 33accf124..66f482724 100644 --- a/docs/user_guides/single_table/custom_constraints.rst +++ b/docs/user_guides/single_table/custom_constraints.rst @@ -174,7 +174,7 @@ would for predefined constraints. bonus_divis_500 ] - model = GaussianCopula(constraints=constraints, min_value=None, max_value=None) + model = GaussianCopula(constraints=constraints, enforce_min_max_values=False) model.fit(employees) diff --git a/docs/user_guides/single_table/gaussian_copula.rst b/docs/user_guides/single_table/gaussian_copula.rst index a251485dc..af8526c40 100644 --- a/docs/user_guides/single_table/gaussian_copula.rst +++ b/docs/user_guides/single_table/gaussian_copula.rst @@ -350,44 +350,6 @@ Now that we have discovered the basics, let's go over a few more advanced usage examples and see the different arguments that we can pass to our ``GaussianCopula`` Model in order to customize it to our needs. -Setting Bounds and Specifying Rounding for Numerical Columns -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -By default, the model will learn the upper and lower bounds of the -input data, and use that for sampling. This means that all sampled data -will be between the maximum and minimum values found in the original -dataset for each numeric column. This option can be overwritten using the -``min_value`` and ``max_value`` model arguments. These values can either -be set to a numeric value, set to ``'auto'`` which is the default setting, -or set to ``None`` which will mean the column is boundless. - -The model will also learn the number of decimal places to round to by default. -This option can be overwritten using the ``rounding`` parameter. The value can -be an int specifying how many decimal places to round to, ``'auto'`` which is -the default setting, or ``None`` which means the data will not be rounded. - -Since we may want to sample values outside of the ranges in the original data, -let's pass the ``min_value`` and ``max_value`` arguments as `None` to the model. -To keep the number of decimals consistent across columns, we can set ``rounding`` -to be 2. - -.. ipython:: python - :okwarning: - - model = GaussianCopula( - primary_key='student_id', - min_value=None, - max_value=None, - rounding=2 - ) - model.fit(data) - - unbounded_data = model.sample(10) - unbounded_data - -As you may notice, the sampled data may have values outside the range of -the original data. - Exploring the Probability Distributions ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ @@ -427,8 +389,7 @@ in our table. We can explore the distributions which the model = GaussianCopula( primary_key='student_id', - min_value=None, - max_value=None + enforce_min_max_values=False ) model.fit(data) distributions = model.get_distributions() @@ -526,8 +487,7 @@ Let's see what happens if we make the ``GaussianCopula`` use the field_distributions={ 'experience_years': 'gamma' }, - min_value=None, - max_value=None + enforce_min_max_values=False ) model.fit(data) diff --git a/docs/user_guides/single_table/handling_constraints.rst b/docs/user_guides/single_table/handling_constraints.rst index c00e46dc3..fafcae0a3 100644 --- a/docs/user_guides/single_table/handling_constraints.rst +++ b/docs/user_guides/single_table/handling_constraints.rst @@ -129,8 +129,8 @@ datetime column name and value. It also expects an inequality relation that must ) .. note:: - All SDV tabular models have min_value and max_value parameters that you set to enforce bounds - on all columns. This constraint is redundant if you set these model parameters. + All SDV tabular models have an enforce_min_max_values parameter that you set to enforce bounds + on all columns. This constraint is redundant if you set this model parameter. Positive and Negative ~~~~~~~~~~~~~~~~~~~~~ @@ -150,8 +150,8 @@ Enforce this by creating a Positive constraint. This object accepts a numerical age_positive = Positive(column_name='age') .. note:: - All SDV tabular models have min_value and max_value parameters that you set to enforce bounds - on all columns. This constraint is redundant if you set these model parameters. + All SDV tabular models have an enforce_min_max_value parameter that you set to enforce bounds + on all columns. This constraint is redundant if you set this model parameter. OneHotEncoding ~~~~~~~~~~~~~~ @@ -250,8 +250,8 @@ ranges are strict (exclusive) or not (inclusive). ) .. note:: - All SDV tabular models have min_value and max_value parameters that you set to enforce bounds - on all columns. This constraint is redundant if you set these model parameters. + All SDV tabular models have an enforce_min_max_values parameter that you set to enforce bounds + on all columns. This constraint is redundant if you set this model parameter. Applying the Constraints ------------------------ @@ -272,7 +272,7 @@ to pass in the objects a list. age_btwn_18_100 ] - model = GaussianCopula(constraints=constraints, min_value=None, max_value=None) + model = GaussianCopula(constraints=constraints, enforce_min_max_values=False) Then you can fit the model using the real data. During this process, the SDV ensures that the model learns the constraints. diff --git a/docs/user_guides/single_table/tvae.rst b/docs/user_guides/single_table/tvae.rst index 3c93cb71f..bbe0edc45 100644 --- a/docs/user_guides/single_table/tvae.rst +++ b/docs/user_guides/single_table/tvae.rst @@ -345,44 +345,6 @@ Now that we have discovered the basics, let's go over a few more advanced usage examples and see the different arguments that we can pass to our ``CTGAN`` Model in order to customize it to our needs. -Setting Bounds and Specifying Rounding for Numerical Columns -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -By default, the model will learn the upper and lower bounds of the -input data, and use that for sampling. This means that all sampled data -will be between the maximum and minimum values found in the original -dataset for each numeric column. This option can be overwritten using the -``min_value`` and ``max_value`` model arguments. These values can either -be set to a numeric value, set to ``'auto'`` which is the default setting, -or set to ``None`` which will mean the column is boundless. - -The model will also learn the number of decimal places to round to by default. -This option can be overwritten using the ``rounding`` parameter. The value can -be an int specifying how many decimal places to round to, ``'auto'`` which is -the default setting, or ``None`` which means the data will not be rounded. - -Since we may want to sample values outside of the ranges in the original data, -let's pass the ``min_value`` and ``max_value`` arguments as `None` to the model. -To keep the number of decimals consistent across columns, we can set ``rounding`` -to be 2. - -.. ipython:: python - :okwarning: - - model = TVAE( - primary_key='student_id', - min_value=None, - max_value=None, - rounding=2 - ) - model.fit(data) - - unbounded_data = model.sample(10) - unbounded_data - -As you may notice, the sampled data may have values outside the range of -the original data. - How to modify the TVAE Hyperparameters? ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ diff --git a/sdv/lite/tabular.py b/sdv/lite/tabular.py index 74dac132b..897e9a127 100644 --- a/sdv/lite/tabular.py +++ b/sdv/lite/tabular.py @@ -82,22 +82,22 @@ def __init__(self, name=None, metadata=None, constraints=None): dtype_transformers = { 'i': rdt.transformers.FloatFormatter( - missing_value_replacement='mean' if self._null_column else None, + missing_value_replacement='mean', model_missing_values=transformer_model_missing_values, enforce_min_max_values=True, ), 'f': rdt.transformers.FloatFormatter( - missing_value_replacement='mean' if self._null_column else None, + missing_value_replacement='mean', model_missing_values=transformer_model_missing_values, enforce_min_max_values=True, ), 'O': rdt.transformers.FrequencyEncoder(add_noise=True), 'b': rdt.transformers.BinaryEncoder( - missing_value_replacement=-1 if self._null_column else None, + missing_value_replacement=-1 if self._null_column else 'mode', model_missing_values=transformer_model_missing_values, ), 'M': rdt.transformers.UnixTimestampEncoder( - missing_value_replacement='mean' if self._null_column else None, + missing_value_replacement='mean' if self._null_column else 'mode', model_missing_values=transformer_model_missing_values, ), } diff --git a/tutorials/evaluation/Evaluating_Synthetic_Data.ipynb b/tutorials/evaluation/Evaluating_Synthetic_Data.ipynb index ed3efd37a..5493125a0 100644 --- a/tutorials/evaluation/Evaluating_Synthetic_Data.ipynb +++ b/tutorials/evaluation/Evaluating_Synthetic_Data.ipynb @@ -687,8 +687,8 @@ "metrics that are applied can be controlled by passing a list with the\n", "names of the metrics that you want to apply.\n", "\n", - "For example, if you were interested on obtaining only the `CSTest` and\n", - "`KSComplement` metrics you can call the `evaluate` function as follows:" + "For example, if you were interested on obtaining only the `CSTest`\n", + "metric you can call the `evaluate` function as follows:" ] }, { @@ -715,7 +715,7 @@ } ], "source": [ - "evaluate(synthetic_data, real_data, metrics=['CSTest', 'KSComplement'])" + "evaluate(synthetic_data, real_data, metrics=['CSTest'])" ] }, { @@ -808,7 +808,7 @@ } ], "source": [ - "evaluate(synthetic_data, real_data, metrics=['CSTest', 'KSComplement'], aggregate=False)" + "evaluate(synthetic_data, real_data, metrics=['CSTest'], aggregate=False)" ] }, { @@ -822,7 +822,7 @@ " returns the average of the `p-values` obtained across all the\n", " columns. If the tables that you are evaluating do not contain any\n", " categorical columns the result will be `nan`.\n", - "- `kstest`: This metric compares the distributions of all the\n", + "- `kscomplement`: This metric compares the distributions of all the\n", " numerical columns of the table with a two-sample Kolmogorov-Smirnov\n", " test using the empirical CDF and returns the average of the\n", " KS statistic values obtained across all the columns. If the tables\n",