Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create a speed optimized Preset #716

Closed
npatki opened this issue Feb 25, 2022 · 0 comments · Fixed by #720 or #737
Closed

Create a speed optimized Preset #716

npatki opened this issue Feb 25, 2022 · 0 comments · Fixed by #720 or #737
Assignees
Labels
data:single-table Related to tabular datasets feature request Request for a new feature
Milestone

Comments

@npatki
Copy link
Contributor

npatki commented Feb 25, 2022

Problem Description

After adding the framework in #715, we should add a configuration that allows the user to optimize a tabular model for speed.

Expected behavior

A speed optimized tabular model will:

  • Use the GaussianCopula
  • Not round numerical data
  • Enforce min/max values on numerical data
  • Use the CategoricalTransformer (to be renamed: FrequencyEncoder) on any categorical columns
  • Do not model null values. Instead, randomly generate null values based on observed proportions
  • Default all univariate distributions to gaussian instead of performing a search

API

The TabularPreset will accept 2 parameters:

  • (required) optimize_for, the name of the preset. We'll add "SPEED"
  • metadata a dictionary or TableMetadata object, similar to tabular models today

It will print out hard-coded info about the configuration & benchmarks.

>>> from sdv.lite import TabularPreset

# the PresetModel will not allow too many extra settings, making it simple to use
>>> model = TabularPreset(optimize_for='SPEED', metadata=my_metadata)
Info: This config optimizes the modeling speed above all else.

Your exact runtime is dependent on the data. Benchmarks:
100K rows and 100 columns may take around 1 minute.
1M rows and 250 columns may take around 30 minutes.

# these work like any other tabular model
>>> model.fit(my_data)
>>> model.sample(my_data)

Edge Cases

If the user does not pass in metadata, we should still run the model, but we should provide a warning:

>>> from sdv.lite import TabularPreset
>>> model = TabularPreset(optimize_for='SPEED')
Warning: No metadata provided. Metadata will be automatically detected from your data. This process may
not be accurate. We recommend writing metadata to ensure correct data handling.

Info: This config optimizes the modeling speed above all else.

Your exact runtime is dependent on the data. Benchmarks:
100K rows and 100 columns may take around 1 minute.
1M rows and 250 columns may take around 30 minutes.

If the user does not pass in the required optimization parameter, throw an error.

>>> from sdv.lite import TabularPreset
>>> model = TabularPreset()
Error: You must provide the name of a preset using the 'optimize_for' parameter. Use 
'TabularPreset.list_available_presets()' to browse through the options.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data:single-table Related to tabular datasets feature request Request for a new feature
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants