Skip to content

Commit

Permalink
Add tabular preset docs (#777)
Browse files Browse the repository at this point in the history
  • Loading branch information
katxiao authored Apr 29, 2022
1 parent 9808f8a commit 81b2df6
Show file tree
Hide file tree
Showing 3 changed files with 255 additions and 4 deletions.
10 changes: 10 additions & 0 deletions docs/user_guides/single_table/models.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2,10 +2,20 @@

Models
======
In this section, you'll find information about using synthetic data models for single table data.

.. note::
Is this your first time using the SDV?

We recommend starting with the new, :ref:`tabular_preset` model. This model comes pre-configured
so you can spend less time choosing parameters or tuning a model, and more time using your
synthetic data.


.. toctree::
:maxdepth: 2

tabular_preset
gaussian_copula
ctgan
copulagan
Expand Down
241 changes: 241 additions & 0 deletions docs/user_guides/single_table/tabular_preset.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,241 @@
.. _tabular_preset:

Tabular Preset
==============

The ``TabularPreset`` is a tabular model that comes with pre-configured settings. This
is meant for users who want to get started with using synthetic data and spend
less time worrying about which model to choose or how to tune its parameters.

.. note::

We are currently in Beta testing our speed-optimized machine learning preset. Help us by
testing the model and `filing issues <https://github.com/sdv-dev/SDV/issues/new/choose>`__
for any bugs or feature requests you may have.

What is the FAST_ML preset?
-----------------------

The ``FAST_ML`` preset is our first preset. It uses machine learning (ML) to model your data
while optimizing for the modeling time. This is a great choice if it's your first time using
the SDV for a large custom dataset or if you're exploring the benefits of using ML to create
synthetic data.

What will you get with this preset?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
- This preset optimizes for the modeling time while still applying machine learning to model
and generate synthetic data.
- Your synthetic data will capture correlations that exist between the columns of the original
data.
- Your synthetic data will adhere to the basic statistical properties of the original columns:
min/max values, averages and standard deviations.

While other SDV models may create higher quality synthetic data, they will take longer.
Using the ``FAST_ML`` preset allows you to get started with ML to create synthetic data
right now.

Quick Usage
-----------

Preparation
~~~~~~~~~~~

To use this preset, you must have:

1. Your data, loaded as a pandas DataFrame, and
2. (Optional but strongly recommended) A metadata file that describes the columns of your dataset

For this guide, we'll load the demo data and metadata from the SDV. This data contains
information about students, including their grades, major and work experience.

.. ipython:: python
:okwarning:
from sdv.demo import load_tabular_demo
metadata, data = load_tabular_demo('student_placements', metadata=True)
data.head()
If you want to use your custom dataset, you can load it using pandas.
For example, if your data is available as a CSV file, you can use the ``read_csv`` method.

You can write your metadata as a dictionary. Follow the
`Metadata guide <https://sdv.dev/SDV/developer_guides/sdv/metadata.html#table>`__ to create
a dictionary for a single table. For example, the metadata for our table looks
something like this:

.. code-block:: json
{
'fields': {
'start_date': {'type': 'datetime', 'format': '%Y-%m-%d'},
'end_date': {'type': 'datetime', 'format': '%Y-%m-%d'},
'salary': {'type': 'numerical', 'subtype': 'integer'},
'duration': {'type': 'categorical'},
'student_id': {'type': 'id', 'subtype': 'integer'},
'high_perc': {'type': 'numerical', 'subtype': 'float'},
'high_spec': {'type': 'categorical'},
'mba_spec': {'type': 'categorical'},
'second_perc': {'type': 'numerical', 'subtype': 'float'},
'gender': {'type': 'categorical'},
'degree_perc': {'type': 'numerical', 'subtype': 'float'},
'placed': {'type': 'boolean'},
'experience_years': {'type': 'numerical', 'subtype': 'integer'},
'employability_perc': {'type': 'numerical', 'subtype': 'float'},
'mba_perc': {'type': 'numerical', 'subtype': 'float'},
'work_experience': {'type': 'boolean'},
'degree_type': {'type': 'categorical'}
},
'constraints': [],
'primary_key': 'student_id'
}
Modeling
~~~~~~~~
Pass in your metadata to create the TabularPreset FAST_ML model.

.. ipython:: python
:okwarning:
from sdv.lite import TabularPreset
# Use the FAST_ML preset to optimize for modeling time
model = TabularPreset(name='FAST_ML', metadata=metadata)
Then, simply pass in your data to train the model.

.. ipython:: python
:okwarning:
model.fit(data)
The modeling step is optimized for speed. The exact time it takes depends on several factors
including the number of rows, columns and distinct categories in categorical columns. As a
rough benchmark, our analysis shows that:

- Datasets with around 100K rows and 50-100 columns will take a few minutes to model
- Larger datasets with around 1M rows and hundreds of columns may take closer to an hour

After you are finished modeling, you can save the fitted model and load it in again for future use.

.. ipython:: python
:okwarning:
# save the model in a new file
model.save('fast_ml_model.pkl')
# later, you can load it in again
model = TabularPreset.load('fast_ml_model.pkl')
Sampling
~~~~~~~~
Once you have your model, you can begin to create synthetic data. Use the sample method and
pass in the number of rows you want to synthesize.

.. ipython:: python
:okwarning:
synthetic_data = model.sample(num_rows=100)
synthetic_data.head()
For creating large amounts of synthetic data, provide a batch_size. This breaks up the sampling
into multiple batches and shows a progress bar. Use the output_file_path parameter to write
results to a file.

.. ipython:: python
:okwarning:
model.sample(num_rows=1_000_000, batch_size=10_000, output_file_path='synthetic_data.csv')
Conditional Sampling
~~~~~~~~~~~~~~~~~~~~
The model generates new synthetic data – synthetic rows that do not refer to the original.
But sometimes you may want to fix some values.

For example, you might only be interested in synthesizing science and commerce students with
work experience. Using **conditional sampling**, you can specify the exact, fixed values that
you need. The SDV model will then synthesize the rest of the data.

First, use the Condition object to specify the exact values you want. You specify a dictionary
of column names and the exact value you want, along with the number of rows to synthesize.

.. ipython:: python
:okwarning:
from sdv.sampling.tabular import Condition
# 100 science students with work experience
science_students = Condition(
column_values={'high_spec': 'Science', 'work_experience': True}, num_rows=100)
# 200 commerce students with work experience
commerce_students = Condition(
column_values={'high_spec': 'Commerce', 'work_experience': True}, num_rows=200)
You can now use the sample_conditions function and pass in a list of conditions.

.. ipython:: python
:okwarning:
all_conditions = [science_students, commerce_students]
model.sample_conditions(conditions=all_conditions)
Advanced Usage
--------------

Adding Constraints
~~~~~~~~~~~~~~~~~~
A constraint is a logical business rule that must be met by every row in your dataset.

In most cases, the preset is able to learn a general trend and create synthetic data where *most*
of the rows follow the rule. Use a constraint if you want to enforce that *all* of the rows
must follow the rule.

In our dataset, we have a constraint: If experience_years=0, then work_experience=False.
Otherwise, work_experience=True. We can describe this using a
`ColumnFormula constraint <https://sdv.dev/SDV/user_guides/single_table/handling_constraints.html#columnformula-constraint>`__.

.. ipython:: python
:okwarning:
from sdv.constraints import ColumnFormula
# define the formula for computing work experience
def calculate_work_experience(data):
return data['experience_years'] > 0
# use the formula when defining the constraint
work_constraint = ColumnFormula(
column='work_experience',
formula=calculate_work_experience,
)
You can input constraints into the presets when creating your model.

.. ipython:: python
:okwarning:
constrained_model = TabularPreset(
name='FAST_ML',
metadata=metadata,
constraints=[work_constraint],
)
constrained_model.fit(data)
When you sample from the model, the synthetic data will follow the constraints

.. ipython:: python
:okwarning:
constrained_synthetic_data = constrained_model.sample(num_rows=1_000)
constrained_synthetic_data.head(10)
To read more about defining constraints, see the
`Handling Constraints User Guide <https://sdv.dev/SDV/user_guides/single_table/handling_constraints.html>`__.

Resources
---------
The SDV (Synthetic Data Vault) is an open source project built & maintained by DataCebo. It is free to use under the MIT License.

For other resources see our: `GitHub <https://github.com/sdv-dev/SDV>`__,
`Docs <https://sdv.dev/SDV/>`__, `Blog <https://sdv.dev/blog/>`__.
8 changes: 4 additions & 4 deletions sdv/lite/tabular.py
Original file line number Diff line number Diff line change
Expand Up @@ -81,25 +81,25 @@ def __init__(self, name=None, metadata=None, constraints=None):
dtype_transformers = {
'i': rdt.transformers.NumericalTransformer(
dtype=np.int64,
nan='mean' if transformer_null_column else None,
nan='mean' if self._null_column else None,
null_column=transformer_null_column,
min_value='auto',
max_value='auto',
),
'f': rdt.transformers.NumericalTransformer(
dtype=np.float64,
nan='mean' if transformer_null_column else None,
nan='mean' if self._null_column else None,
null_column=transformer_null_column,
min_value='auto',
max_value='auto',
),
'O': rdt.transformers.CategoricalTransformer(fuzzy=True),
'b': rdt.transformers.BooleanTransformer(
nan=-1 if transformer_null_column else None,
nan=-1 if self._null_column else None,
null_column=transformer_null_column,
),
'M': rdt.transformers.DatetimeTransformer(
nan='mean' if transformer_null_column else None,
nan='mean' if self._null_column else None,
null_column=transformer_null_column,
),
}
Expand Down

0 comments on commit 81b2df6

Please sign in to comment.