Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add boston housing example #27

Open
wants to merge 11 commits into
base: master
Choose a base branch
from
5 changes: 5 additions & 0 deletions boston_housing/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
project/raw_dataset.txt
project/processed_dataset.csv
mlcube/workspace/data
mlcube/run
mlcube/tasks
226 changes: 226 additions & 0 deletions boston_housing/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,226 @@
# Packing an existing project into MLCUbe
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great job going through explaining things in detail!


In this tutorial we're going to use the [Boston Housing Dataset](https://www.cs.toronto.edu/~delve/data/boston/bostonDetail.html). We'll take an existing implementation, create the needed files to pack it into MLCube and execute all tasks.

## Original project code

At fist we have only 4 files, one for package dependencies and 3 scripts for each task: download data, preprocess data and train.

```bash
├── project
├── 01_download_dataset.py
├── 02_preprocess_dataset.py
├── 03_train.py
└── requirements.txt
```

The most important thing that we need to remember about these scripts are the input parameters:

* 01_download_dataset.py

**--data_dir** : Dataset download path, inside this folder path a new file called raw_dataset.txt will be created.

* 02_preprocess_dataset.py

**--data_dir** : Folder path containing raw dataset file, when finished a new file called processed_dataset.csv will be created.

* 03_train.py

**--dataset_file_path** : Processed dataset file path. Note: this is the full path to the csv file.
**--n_estimators** : Number of boosting stages to perform. In this case we're using a gradient boosting regressor.

## MLCube scructure

We'll need a couple of files for MLCube, first we'll need to create a folder called **mlcube** in the same path from as project folder. We'll need to create the following structure (for this tutorial the files are already in place)

```bash
├── mlcube
│   ├── mlcube.yaml
│   └── workspace
│   └── parameters.yaml
└── project
├── 01_download_dataset.py
├── 02_preprocess_dataset.py
├── 03_train.py
└── requirements.txt
```

In the following steps we'll describe each file.

## Define tasks execution scripts

In general, we'll have a script for each task, and there are different ways to describe their execution from a main hanlder file, in this tutorial we'll use a function from the Python subprocess modeule:

* subprocess.Popen()

When we don't have input parameters for a Python script (or maybe just one) we can describe the execution of that script from Python code as follows:

```Python
import subprocess
# Set the full command as variable
command = "python my_task.py --single_parameter input"
# Split the command, this will give us the list:
# ['python', 'my_task.py', '--single_parameter', 'input']
splitted_command = command.split()
# Execute the command as a new process
process = subprocess.Popen(splitted_command, cwd=".")
# Wait for the process to finish
process.wait()
```

### MLCube File: mlcube/workspace/parameters.yaml

When we have a script with multiple input parameters, it will be hard to store the full command to execute it in a single variable, in this case we can create a shell script describing all the arguments and even add some extra fucntionalities, this will useful since we can define the input parameters as environment variables.

We can use the **mlcube/workspace/parameters.yaml** file to describe all the input parameters we'll use (this file is already provided, please take a look and study its content), the idea is to describe all the parameters in this file and then use this single file as an input for the task. Then we can read the content of the parameters file in Python and set all the parameters as environment variables. Finally with the environment variables setted we can excute a shell script with our implementation.

The way we execute all these steps in Python is described below.

```Python
import os
import yaml
# Read the file and store the parameters in a variable
with open(parameters_file, 'r') as stream:
parameters = yaml.safe_load(stream)
# Get the system's enviroment
env = os.environ.copy()
# We can add a single new enviroment as follows
env.update({
'NEW_ENV_VARIABLE': "my_new_env_variable",
})
# Add all the parameters we got from the parameters file
env.update(parameters)
# Execute the shell script with the updated enviroment
process = subprocess.Popen("./run_and_time.sh", cwd=".", env=env)
# Wait for the process to finish
process.wait()
```

### Shell script

In this tutorial we already have a shell script containing the steps to run the train task, the file is: **project/run_and_time.sh**, please take a look and study its content.

### MLCube Command

We are targeting pull-type installation, so MLCube images should be available on docker hub. If not, try this:

```bash
mlcube run ... -Pdocker.build_strategy=auto
```

Parameters defined in mlcube.yaml can be overridden using: param=input, example:

```bash
mlcube run --task=download_data data_dir=absolute_path_to_custom_dir
```

Also, users can override the workspace directory by using:

```bash
mlcube run --task=download_data --workspace=absolute_path_to_custom_dir
```

Note: Sometimes, overriding the workspace path could fail for some task, this is because the input parameter parameters_file should be specified, to solve this use:

```bash
mlcube run --task=train --workspace=absolute_path_to_custom_dir parameters_file=$(pwd)/workspace/parameters.yaml
```

### MLCube Python entrypoint file

At this point we know how to execute the tasks sripts from Python code, now we can create a file that contains the definition on how to run each task.

This file will be located in **project/mlcube.py**, this is the main file that will serve as the entrypoint to run all tasks.

This file is already provided, please take a look and study its content.

## Dockerize the project

We'll create a Dockerfile with the needed steps to run the project, at the end we'll need to define the execution of the **mlcube.py** file as the entrypoint. This file will be located in **project/Dockerfile**.

This file is already provided, please take a look and study its content.

When creating the docker image, we'll need to run the docker build command inside the project folder, the command that we'll use is:

`docker build . -t mlcommons/boston_housing:0.0.1 -f Dockerfile`

Keep in mind the tag that we just described.

At this point our solution folder structure should look like this:

```bash
├── mlcube
│   ├── mlcube.yaml
│   └── workspace
│   └── parameters.yaml
└── project
├── 01_download_dataset.py
├── 02_preprocess_dataset.py
├── 03_train.py
├── Dockerfile
├── mlcube.py
├── requirements.txt
└── run_and_time.sh
```

### Define MLCube files

Inside the mlcube folder we'll need to define the following files.

### mlcube/platforms/docker.yaml

This file contains the description of the platform that we'll use to run MLCube, in this case is Docker. In the container definition we'll have the following subfields:

* command: Main command to run, in this case is docker
* run_args: In this field we'll define all the arguments to run the docker conatiner, e.g. --rm, --gpus, etc.
* image: Image to use, in this case we'll need to use the same image tag from the docker build command.

This file is already provided, please take a look and study its content.

### MLCube task definition file

The file located in **mlcube/mlcube.yaml** contains the definition of all the tasks and their parameters.

This file is already provided, please take a look and study its content.

With this file we have finished the packing of the project into MLCube! Now we can setup the project and run all the tasks.

### Project setup

## Project setup

```bash
# Create Python environment and install MLCube Docker runner
virtualenv -p python3 ./env && source ./env/bin/activate && pip install mlcube-docker

# Fetch the boston housing example from GitHub
git clone https://github.com/mlcommons/mlcube_examples && cd ./mlcube_examples
git fetch origin pull/27/head:feature/boston_housing && git checkout feature/boston_housing
cd ./boston_housing/mlcube
```

### Dataset

The [Boston Housing Dataset](https://www.cs.toronto.edu/~delve/data/boston/bostonDetail.html) will be downloaded and processed. Sizes of the dataset in each step:

| Dataset Step | MLCube Task | Format | Size |
|--------------------------------|-------------------|------------|---------|
| Downlaod (Compressed dataset) | download_data | txt file | ~52 KB |
| Preprocess (Processed dataset) | preprocess_data | csv file | ~40 KB |
| Total | (After all tasks) | All | ~92 KB |

### Tasks execution

```bash
# Download Boston housing dataset. Default path = /workspace/data
# To override it, use data_dir=DATA_DIR
mlcube run --task download_data

# Preprocess Boston housing dataset, this will convert raw .txt data to .csv format
# It will use the DATA_DIR path defined in the previous step
mlcube run --task preprocess_data

# Run training.
# Parameters to override: dataset_file_path=DATASET_FILE_PATH parameters_file=PATH_TO_TRAINING_PARAMS
mlcube run --task train
```
32 changes: 32 additions & 0 deletions boston_housing/mlcube/mlcube.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
name: MLCommons Boston Housing
description: MLCommons Boston Housing example
authors:
- {name: "MLCommons Best Practices Working Group"}

platform:
accelerator_count: 0

docker:
# Image name.
image: mlcommons/boston_housing:0.0.1
# Docker build context relative to $MLCUBE_ROOT. Default is `build`.
build_context: "../project"
# Docker file name within docker build context, default is `Dockerfile`.
build_file: "Dockerfile"

tasks:
download_data:
# Download boston housing dataset
parameters:
# Directory where dataset will be saved.
outputs: {data_dir: data/}
preprocess_data:
# Preprocess dataset
parameters:
# Same directory location where dataset was downloaded
inputs: {data_dir: data/}
train:
# Train gradient boosting regressor model
parameters:
# Processed dataset file
inputs: {dataset_file_path: data/processed_dataset.csv, parameters_file: parameters.yaml}
1 change: 1 addition & 0 deletions boston_housing/mlcube/workspace/parameters.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
N_ESTIMATORS: "500"
34 changes: 34 additions & 0 deletions boston_housing/project/01_download_dataset.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
"""Download the raw Boston Housing Dataset"""
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love the simple straight forward files you built here!

import os
import argparse
import requests

DATASET_URL = "http://lib.stat.cmu.edu/datasets/boston"


def download_dataset(data_dir):
"""Download dataset and store it in a given path.
Args:
data_dir (str): Dataset download path."""

request = requests.get(DATASET_URL)
file_name = "raw_dataset.txt"
file_path = os.path.join(data_dir, file_name)
with open(file_path,'wb') as f:
f.write(request.content)
print(f"\nRaw dataset saved at: {file_path}")


def main():

parser = argparse.ArgumentParser(description='Download dataset')
parser.add_argument('--data_dir', required=True,
help='Dataset download path')
args = parser.parse_args()

data_dir = args.data_dir
download_dataset(data_dir)


if __name__ == '__main__':
main()
39 changes: 39 additions & 0 deletions boston_housing/project/02_preprocess_dataset.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
"""Preprocess the dataset and save in CSV format"""
import os
import argparse
import pandas as pd

def process_data(data_dir):
"""Process raw dataset and save it in CSV format.
Args:
data_dir (str): Folder path containing dataset."""

col_names = ["CRIM", "ZN", "INDUS", "CHAS", "NOX", "RM", "AGE", "DIS", "RAD", "TAX", "PTRATIO", "B", "LSTAT", "PRICE"]
raw_file = os.path.join(data_dir, "raw_dataset.txt")
print(f"\nProcessing raw file: {raw_file}")

df = pd.read_csv(raw_file, skiprows=22, header=None, delim_whitespace=True)
df_even=df[df.index%2==0].reset_index(drop=True)
df_odd=df[df.index%2==1].iloc[: , :3].reset_index(drop=True)
df_odd.columns = [11,12,13]
dataset = df_even.join(df_odd)
dataset.columns = col_names

output_file = os.path.join(data_dir, "processed_dataset.csv")
dataset.to_csv(output_file, index=False)
print(f"Processed dataset saved at: {output_file}")


def main():

parser = argparse.ArgumentParser(description='Preprocess dataset')
parser.add_argument('--data_dir', required=True,
help='Folder containing dataset file')
args = parser.parse_args()

data_dir = args.data_dir
process_data(data_dir)


if __name__ == '__main__':
main()
46 changes: 46 additions & 0 deletions boston_housing/project/03_train.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
"""Train gradient boosting regressor on Boston housing dataset"""
import os
import argparse
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.ensemble import GradientBoostingRegressor


def train(dataset_file_path, n_estimators):
df = pd.read_csv(dataset_file_path)

data = df.drop(['PRICE'], axis=1)
target = df[['PRICE']]
X_train, X_test, Y_train, Y_test = train_test_split(data, target, test_size = 0.25)

clf = GradientBoostingRegressor(n_estimators=n_estimators, verbose = 1)
clf.fit(X_train, Y_train.values.ravel())

train_predicted = clf.predict(X_train)
train_expected = Y_train
train_rmse = mean_squared_error(train_predicted, train_expected, squared=False)

test_predicted = clf.predict(X_test)
test_expected = Y_test
test_rmse = mean_squared_error(test_predicted, test_expected, squared=False)

print(f"\nTRAIN RMSE:\t{train_rmse}")
print(f"TEST RMSE:\t{test_rmse}")

def main():

parser = argparse.ArgumentParser(description='Train model')
parser.add_argument('--dataset_file_path', required=True,
help='Processed dataset file path')
parser.add_argument('--n_estimators', type=int, default=100,
help='number of boosting stages to perform')
args = parser.parse_args()

dataset_file_path = args.dataset_file_path
n_estimators = args.n_estimators
train(dataset_file_path, n_estimators)


if __name__ == '__main__':
main()
Loading