Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add boston housing example #27

Open
wants to merge 11 commits into
base: master
Choose a base branch
from
6 changes: 6 additions & 0 deletions boston_housing/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
project/raw_dataset.txt
project/processed_dataset.csv
mlcube/workspace/data
mlcube/run
mlcube/tasks
mlcube/mlcube.yaml
216 changes: 216 additions & 0 deletions boston_housing/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,216 @@
# Packing an existing project into MLCUbe
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great job going through explaining things in detail!


In this tutorial we're going to use the [Boston Housing Dataset](https://www.cs.toronto.edu/~delve/data/boston/bostonDetail.html). We'll take an existing implementation, create the needed files to pack it into MLCube and execute all tasks.


## Original project code

At fist we have only 4 files, one for package dependencies and 3 scripts for each task: download data, preprocess data and train.

```
├── project
├── 01_download_dataset.py
├── 02_preprocess_dataset.py
├── 03_train.py
└── requirements.txt
```

The most important thing that we need to remember about these scripts are the input parameters:

* 01_download_dataset.py

**--data_dir** : Dataset download path, inside this folder path a new file called raw_dataset.txt will be created.

* 02_preprocess_dataset.py

**--data_dir** : Folder path containing raw dataset file, when finished a new file called processed_dataset.csv will be created.

* 03_train.py

**--dataset_file_path** : Processed dataset file path. Note: this is the full path to the csv file.
**--n_estimators** : Number of boosting stages to perform. In this case we're using a gradient boosting regressor.


## MLCube scructure

We'll need some files for MLCube, first we'll need to create a folder called **mlcube** in the same path from as project folder. We'll need to create the following structure (for this tutorial the files are already in place but some of them are empty for you to define their content)

```
├── mlcube
│   ├── .mlcube.yaml
│   ├── platforms
│   │   └── docker.yaml
│   └── workspace
│   └── parameters.yaml
└── project
├── 01_download_dataset.py
├── 02_preprocess_dataset.py
├── 03_train.py
└── requirements.txt
```

In the following steps we'll describe each file.

## Define tasks execution scripts

In general, we'll have a script for each task, and there are different ways to describe their execution from a main hanlder file, in this tutorial we'll use a function from the Python subprocess modeule:

* subprocess.Popen()

When we don't have input parameters for a Python script (or maybe just one) we can describe the execution of that script from Python code as follows:

```Python
import subprocess
# Set the full command as variable
command = "python my_task.py --single_parameter input"
# Split the command, this will give us the list:
# ['python', 'my_task.py', '--single_parameter', 'input']
splitted_command = command.split()
# Execute the command as a new process
process = subprocess.Popen(splitted_command, cwd=".")
# Wait for the process to finish
process.wait()
```

### MLCube File: mlcube/workspace/parameters.yaml

When we have a script with multiple input parameters, it will be hard to store the full command to execute it in a single variable, in this case we can create a shell script describing all the arguments and even add some extra fucntionalities, this will useful since we can define the input parameters as environment variables.

We can use the **mlcube/workspace/parameters.yaml** file to describe all the input parameters we'll use (this file is already provided, please take a look and study its content), the idea is to describe all the parameters in this file and then use this single file as an input for the task. Then we can read the content of the parameters file in Python and set all the parameters as environment variables. Finally with the environment variables setted we can excute a shell script with our implementation.

The way we execute all these steps in Python is described below.

```Python
import os
import yaml
# Read the file and store the parameters in a variable
with open(parameters_file, 'r') as stream:
parameters = yaml.safe_load(stream)
# Get the system's enviroment
env = os.environ.copy()
# We can add a single new enviroment as follows
env.update({
'NEW_ENV_VARIABLE': "my_new_env_variable",
})
# Add all the parameters we got from the parameters file
env.update(parameters)
# Execute the shell script with the updated enviroment
process = subprocess.Popen("./run_and_time.sh", cwd=".", env=env)
# Wait for the process to finish
process.wait()
```

### Shell script

In this tutorial we already have a shell script containing the steps to run the train task, the file is: **project/run_and_time.sh**, please take a look and study its content.

### MLCube handler Python file

At this point we know how to execute the tasks sripts from Python code, now we can create a file that contains the definition on how to run each task.

This file will be located in **project/mlcube.py**, this is the main file that will serve as the entrypoint to run all tasks.

This file is already provided, please take a look and study its content.

## Dockerize the project

We'll create a Dockerfile with the needed steps to run the project, at the end we'll need to define the execution of the **mlcube.py** file as the entrypoint. This file will be located in **project/Dockerfile**.

This file is already provided, please take a look and study its content.

When creating the docker image, we'll need to run the docker build command inside the project folder, the command that we'll use is:

`docker build . -t mlcommons/boston_housing:0.0.1 -f Dockerfile`

Keep in mind the tag that we just described.

At this point our solution folder structure should look like this:

```
├── mlcube
│   ├── .mlcube.yaml
│   ├── platforms
│   │   └── docker.yaml
│   └── workspace
│   └── parameters.yaml
└── project
├── 01_download_dataset.py
├── 02_preprocess_dataset.py
├── 03_train.py
├── Dockerfile
├── mlcube.py
├── requirements.txt
└── run_and_time.sh
```


### Define MLCube files

Inside the mlcube folder we'll need to define the following files.

### mlcube/platforms/docker.yaml

This file contains the description of the platform that we'll use to run MLCube, in this case is Docker. In the container definition we'll have the following subfields:

* command: Main command to run, in this case is docker
* run_args: In this field we'll define all the arguments to run the docker conatiner, e.g. --rm, --gpus, etc.
* image: Image to use, in this case we'll need to use the same image tag from the docker build command.

This file is already provided, please take a look and study its content.

### MLCube task definition file

The file located in **mlcube/.mlcube.yaml** contains the definition of all the tasks and their parameters.

This file is already provided, please take a look and study its content.

With this file we have finished the packing of the project into MLCube! Now we can setup the project and run all the tasks.


### Project setup
```Python
# Create Python environment
virtualenv -p python3 ./env && source ./env/bin/activate

# Install MLCube and MLCube docker runner from GitHub repository (normally, users will just run `pip install mlcube mlcube_docker`)
git clone https://github.com/mlcommons/mlcube && cd ./mlcube
cd ./mlcube && python setup.py bdist_wheel && pip install --force-reinstall ./dist/mlcube-* && cd ..
cd ./runners/mlcube_docker && python setup.py bdist_wheel && pip install --force-reinstall --no-deps ./dist/mlcube_docker-* && cd ../../..
python3 -m pip install tornado

# Fetch the boston housing example from GitHub
git clone https://github.com/mlcommons/mlcube_examples && cd ./mlcube_examples
git fetch origin pull/27/head:feature/boston_housing && git checkout feature/boston_housing
cd ./boston_housing/project

# Build MLCube docker image.
docker build . -t mlcommons/boston_housing:0.0.1 -f Dockerfile

# Show tasks implemented in this MLCube.
cd ../mlcube && mlcube describe
```

### Dataset

The [Boston Housing Dataset](https://www.cs.toronto.edu/~delve/data/boston/bostonDetail.html) will be downloaded and processed. Sizes of the dataset in each step:

| Dataset Step | MLCube Task | Format | Size |
|--------------------------------|-------------------|------------|---------|
| Downlaod (Compressed dataset) | download_data | txt file | ~52 KB |
| Preprocess (Processed dataset) | preprocess_data | csv file | ~40 KB |
| Total | (After all tasks) | All | ~92 KB |

### Tasks execution
```
# Download Boston housing dataset. Default path = /workspace/data
# To override it, use --data_dir=DATA_DIR
mlcube run --task download_data --platform docker

# Preprocess Boston housing dataset, this will convert raw .txt data to .csv format
# It will use the DATA_DIR path defined in the previous step
mlcube run --task preprocess_data --platform docker

# Run training.
# Parameters to override: --dataset_file_path=DATASET_FILE_PATH --parameters_file=PATH_TO_TRAINING_PARAMS
mlcube run --task train --platform docker
```
29 changes: 29 additions & 0 deletions boston_housing/mlcube/.mlcube.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
name: MLCommons Boston Housing
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't follow why this is called ".mlcube.yaml"; whats the motivation for prefixed "."?

Copy link
Contributor Author

@davidjurado davidjurado Jul 19, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ".mlcube.yaml" file corresponds to the packed version for the MLCube definition, this file contains the needed information to unpack the full MLCube original structure for folders and files using the command: mlcube describe. This will also generate a file called "mlcube.yaml" (without the initial dot) that is the original MLCube structure definition file.

I'm currently working on the implementation of config 2.0 for this tutorial, when this is done we won't need to extract the MLCube original folder structure, and instead, there will be only a single file called: "mlcube.yaml" with all the needed information.

author: MLCommons Best Practices Working Group

tasks:
# Download boston housing dataset
download_data:
parameters:
# Directory where dataset will be saved.
- {name: data_dir, type: directory, io: output}
tasks:
download_data: {data_dir: $WORKSPACE/data}
# Preprocess dataset
preprocess_data:
parameters:
# Same directory location where dataset was downloaded
- {name: data_dir, type: directory, io: output}
tasks:
preprocess_data: {data_dir: $WORKSPACE/data}
# Train gradient boosting regressor model
train:
parameters:
# Processed dataset file
- {name: dataset_file_path, type: file, io: input}
# Yaml file with training parameters.
- {name: parameters_file, type: file, io: input}
tasks:
train:
dataset_file_path: $WORKSPACE/data/processed_dataset.csv
parameters_file: $WORKSPACE/parameters.yaml
15 changes: 15 additions & 0 deletions boston_housing/mlcube/platforms/docker.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
schema_type: mlcube_platform
schema_version: 0.1.0

platform:
name: "docker"
version: ">=18.01"

container:
command: docker
run_args: >-
--rm --net=host --uts=host --ipc=host
--ulimit stack=67108864 --ulimit memlock=-1
--privileged=true --security-opt seccomp=unconfined
-v /dev/shm:/dev/shm
image: mlcommons/boston_housing:0.0.1
1 change: 1 addition & 0 deletions boston_housing/mlcube/workspace/parameters.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
N_ESTIMATORS: "500"
34 changes: 34 additions & 0 deletions boston_housing/project/01_download_dataset.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
"""Download the raw Boston Housing Dataset"""
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love the simple straight forward files you built here!

import os
import argparse
import requests

DATASET_URL = "http://lib.stat.cmu.edu/datasets/boston"


def download_dataset(data_dir):
"""Download dataset and store it in a given path.
Args:
data_dir (str): Dataset download path."""

request = requests.get(DATASET_URL)
file_name = "raw_dataset.txt"
file_path = os.path.join(data_dir, file_name)
with open(file_path,'wb') as f:
f.write(request.content)
print(f"\nRaw dataset saved at: {file_path}")


def main():

parser = argparse.ArgumentParser(description='Download dataset')
parser.add_argument('--data_dir', required=True,
help='Dataset download path')
args = parser.parse_args()

data_dir = args.data_dir
download_dataset(data_dir)


if __name__ == '__main__':
main()
39 changes: 39 additions & 0 deletions boston_housing/project/02_preprocess_dataset.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
"""Preprocess the dataset and save in CSV format"""
import os
import argparse
import pandas as pd

def process_data(data_dir):
"""Process raw dataset and save it in CSV format.
Args:
data_dir (str): Folder path containing dataset."""

col_names = ["CRIM", "ZN", "INDUS", "CHAS", "NOX", "RM", "AGE", "DIS", "RAD", "TAX", "PTRATIO", "B", "LSTAT", "PRICE"]
raw_file = os.path.join(data_dir, "raw_dataset.txt")
print(f"\nProcessing raw file: {raw_file}")

df = pd.read_csv(raw_file, skiprows=22, header=None, delim_whitespace=True)
df_even=df[df.index%2==0].reset_index(drop=True)
df_odd=df[df.index%2==1].iloc[: , :3].reset_index(drop=True)
df_odd.columns = [11,12,13]
dataset = df_even.join(df_odd)
dataset.columns = col_names

output_file = os.path.join(data_dir, "processed_dataset.csv")
dataset.to_csv(output_file, index=False)
print(f"Processed dataset saved at: {output_file}")


def main():

parser = argparse.ArgumentParser(description='Preprocess dataset')
parser.add_argument('--data_dir', required=True,
help='Folder containing dataset file')
args = parser.parse_args()

data_dir = args.data_dir
process_data(data_dir)


if __name__ == '__main__':
main()
Loading