Create a speed optimized Preset #716

npatki · 2022-02-25T18:00:24Z

Problem Description

After adding the framework in #715, we should add a configuration that allows the user to optimize a tabular model for speed.

Expected behavior

A speed optimized tabular model will:

Use the GaussianCopula
Not round numerical data
Enforce min/max values on numerical data
Use the CategoricalTransformer (to be renamed: FrequencyEncoder) on any categorical columns
Do not model null values. Instead, randomly generate null values based on observed proportions
Default all univariate distributions to gaussian instead of performing a search

API

The TabularPreset will accept 2 parameters:

(required) optimize_for, the name of the preset. We'll add "SPEED"
metadata a dictionary or TableMetadata object, similar to tabular models today

It will print out hard-coded info about the configuration & benchmarks.

>>> from sdv.lite import TabularPreset

# the PresetModel will not allow too many extra settings, making it simple to use
>>> model = TabularPreset(optimize_for='SPEED', metadata=my_metadata)
Info: This config optimizes the modeling speed above all else.

Your exact runtime is dependent on the data. Benchmarks:
100K rows and 100 columns may take around 1 minute.
1M rows and 250 columns may take around 30 minutes.

# these work like any other tabular model
>>> model.fit(my_data)
>>> model.sample(my_data)

Edge Cases

If the user does not pass in metadata, we should still run the model, but we should provide a warning:

>>> from sdv.lite import TabularPreset
>>> model = TabularPreset(optimize_for='SPEED')
Warning: No metadata provided. Metadata will be automatically detected from your data. This process may
not be accurate. We recommend writing metadata to ensure correct data handling.

Info: This config optimizes the modeling speed above all else.

Your exact runtime is dependent on the data. Benchmarks:
100K rows and 100 columns may take around 1 minute.
1M rows and 250 columns may take around 30 minutes.

If the user does not pass in the required optimization parameter, throw an error.

>>> from sdv.lite import TabularPreset
>>> model = TabularPreset()
Error: You must provide the name of a preset using the 'optimize_for' parameter. Use 
'TabularPreset.list_available_presets()' to browse through the options.

The text was updated successfully, but these errors were encountered:

npatki added feature request Request for a new feature data:single-table Related to tabular datasets labels Feb 25, 2022

npatki mentioned this issue Feb 25, 2022

Create a method to browse presets #717

Closed

katxiao mentioned this issue Mar 3, 2022

Create sdv presets #720

Merged

katxiao self-assigned this Mar 3, 2022

katxiao closed this as completed in #720 Mar 17, 2022

katxiao mentioned this issue Mar 17, 2022

Handle null values in speed preset #737

Merged

This was referenced Mar 29, 2022

Change preset optimize_for --> name #749

Closed

Presets should be able to handle constraints #753

Closed

katxiao added this to the 0.14.1 milestone May 3, 2022

npatki mentioned this issue Jun 10, 2022

Model configuration guidelines #582

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create a speed optimized Preset #716

Create a speed optimized Preset #716

npatki commented Feb 25, 2022 •

edited

Loading

Create a speed optimized Preset #716

Create a speed optimized Preset #716

Comments

npatki commented Feb 25, 2022 • edited Loading

Problem Description

Expected behavior

API

Edge Cases

npatki commented Feb 25, 2022 •

edited

Loading