Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Autowig #151

Open
wants to merge 14 commits into
base: master
Choose a base branch
from
131 changes: 56 additions & 75 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,32 +1,28 @@
DataWig - Imputation for Tables
================================

[![PyPI version](https://badge.fury.io/py/datawig.svg)](https://badge.fury.io/py/datawig.svg)
[![GitHub license](https://img.shields.io/github/license/awslabs/datawig.svg)](https://github.com/awslabs/datawig/blob/master/LICENSE)
[![GitHub issues](https://img.shields.io/github/issues/awslabs/datawig.svg)](https://github.com/awslabs/datawig/issues)
[![Build Status](https://travis-ci.org/awslabs/datawig.svg?branch=master)](https://travis-ci.org/awslabs/datawig)

DataWig learns Machine Learning models to impute missing values in tables.

See our user-guide and extended documentation [here](https://datawig.readthedocs.io/en/latest).
The latest version of DataWig is built around the [tabular prediction API of AutoGluon](https://auto.gluon.ai/stable/tutorials/tabular_prediction/index.html).

This change will lead to better imputation models and faster training -- but not all of the original DataWig API is yet migrated.

## Installation

### CPU
```bash
pip3 install datawig
Clone the repository from git and set up virtualenv in the root dir of the package:

```
python3 -m venv venv
```

### GPU
If you want to run DataWig on a GPU you need to make sure your version of Apache MXNet Incubating contains the GPU bindings.
Depending on your version of CUDA, you can do this by running the following:
Install the package from local sources:

```bash
wget https://raw.githubusercontent.com/awslabs/datawig/master/requirements/requirements.gpu-cu${CUDA_VERSION}.txt
pip install datawig --no-deps -r requirements.gpu-cu${CUDA_VERSION}.txt
rm requirements.gpu-cu${CUDA_VERSION}.txt
```
where `${CUDA_VERSION}` can be `75` (7.5), `80` (8.0), `90` (9.0), or `91` (9.1).
./venv/bin/pip install -e .
```

## Running DataWig
The DataWig API expects your data as a [pandas DataFrame](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html). Here is an example of how the dataframe might look:
Expand All @@ -37,124 +33,109 @@ The DataWig API expects your data as a [pandas DataFrame](https://pandas.pydata.
| SDCards | Best SDCard ever ... | 8GB | Blue |
| Dress | This **yellow** dress | M | **?** |

### Quickstart Example
DataWig let's you impute missing values in two ways:
* A `.complete` functionality inspired by [`fancyimpute`](https://github.com/iskandr/fancyimpute)
* A `sklearn`-like API with `.fit` and `.predict` methods

## Quickstart Example

For most use cases, the `SimpleImputer` class is the best starting point. For convenience there is the function [SimpleImputer.complete](https://datawig.readthedocs.io/en/latest/source/API.html#datawig.simple_imputer.SimpleImputer.complete) that takes a DataFrame and fits an imputation model for each column with missing values, with all other columns as inputs:
Here are some examples of the DataWig API, also available as [notebook](datawig-examples.ipynb)

### Using `AutoGluonImputer.complete`

```python
import datawig, numpy

# generate some data with simple nonlinear dependency
df = datawig.utils.generate_df_numeric()
df = datawig.utils.generate_df_numeric()
# mask 10% of the values
df_with_missing = df.mask(numpy.random.rand(*df.shape) > .9)

# impute missing values
df_with_missing_imputed = datawig.SimpleImputer.complete(df_with_missing)
df_with_missing_imputed = datawig.AutoGluonImputer.complete(df_with_missing)

```

You can also impute values in specific columns only (called `output_column` below) using values in other columns (called `input_columns` below). DataWig currently supports imputation of categorical columns and numeric columns.
### Using `AutoGluonImputer.fit` and `.predict`

This usage is very similar to using the underlying [tabular prediction API of AutoGluon](https://auto.gluon.ai/stable/tutorials/tabular_prediction/index.html) - but we added some convenience functionality such as a precision filtering for categorical imputations.

### Imputation of categorical columns
You can also impute values in specific columns only (called `output_column` below) using values in other columns (called `input_columns` below). DataWig currently supports imputation of categorical columns and numeric columns. Type inference is based on [``pandas.api.types.is_numeric_dtype``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.api.types.is_numeric_dtype.html) .

#### Imputation of categorical columns

Let's first generate some random strings hidden in longer random strings:

```python
import datawig

df = datawig.utils.generate_df_string( num_samples=200,
data_column_name='sentences',
df = datawig.utils.generate_df_string( num_samples=200,
data_column_name='sentences',
label_column_name='label')
df.head(n=2)
```

The generate data will look like this:

|sentences |label|
|---------|-------|
| wILsn T366D r1Psz KAnDn 8RfUf GuuRU |8RfUf|
| 8RfUf jBq5U BqVnh pnXfL GuuRU XYnSP |8RfUf|

Now let's split the rows into training and test data and train an imputation model

```python
df_train, df_test = datawig.utils.random_split(df)

#Initialize a SimpleImputer model
imputer = datawig.SimpleImputer(
imputer = datawig.AutoGluonImputer(
input_columns=['sentences'], # column(s) containing information about the column we want to impute
output_column='label', # the column we'd like to impute values for
output_path = 'imputer_model' # stores model data and metrics
output_column='label' # the column we'd like to impute values for
)

#Fit an imputer model on the train data
imputer.fit(train_df=df_train)
imputer.fit(train_df=df_train, time_limit=100)

#Impute missing values and return original dataframe with predictions
imputed = imputer.predict(df_test)
```

### Imputation of numerical columns
#### Imputation of numerical columns

Imputation of numerical values works just like for categorical values.

Let's first generate some numeric values with a quadratic dependency:

```python
import datawig

df = datawig.utils.generate_df_numeric( num_samples=200,
data_column_name='x',
label_column_name='y')
df = datawig.utils.generate_df_numeric( num_samples=200,
data_column_name='x',
label_column_name='y')

df_train, df_test = datawig.utils.random_split(df)

#Initialize a SimpleImputer model
imputer = datawig.SimpleImputer(
imputer = datawig.AutoGluonImputer(
input_columns=['x'], # column(s) containing information about the column we want to impute
output_column='y', # the column we'd like to impute values for
output_path = 'imputer_model' # stores model data and metrics
)

#Fit an imputer model on the train data
imputer.fit(train_df=df_train, num_epochs=50)
imputer.fit(train_df=df_train, time_limit=100)

#Impute missing values and return original dataframe with predictions
imputed = imputer.predict(df_test)

```

In order to have more control over the types of models and preprocessings, the `Imputer` class allows directly specifying all relevant model features and parameters.

For details on usage, refer to the provided [examples](./examples).

### Acknowledgments
Thanks to [David Greenberg](https://github.com/dgreenberg) for the package name.

### Building documentation

```bash
git clone git@github.com:awslabs/datawig.git
cd datawig/docs
make html
open _build/html/index.html
```


### Executing Tests

Clone the repository from git and set up virtualenv in the root dir of the package:

```
python3 -m venv venv
```

Install the package from local sources:

```
./venv/bin/pip install -e .
```

Run tests:

```
./venv/bin/pip install -r requirements/requirements.dev.txt
./venv/bin/python -m pytest
```


### Updating PyPi distribution

Before updating, increment the version in setup.py.

```
git clone git@github.com:awslabs/datawig.git
cd datawig
# build local distribution for current version
python setup.py sdist
# upload to PyPi
twine upload --skip-existing dist/*
```

Loading