-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add boston housing example #27
Open
davidjurado
wants to merge
11
commits into
mlcommons:master
Choose a base branch
from
davidjurado:feature/boston_housing
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
11 commits
Select commit
Hold shift + click to select a range
96dd72a
Add boston housing example
davidjurado b11325d
Update README.md
davidjurado d744201
Update to config 2.0
davidjurado 9251d7b
Add support for overriding parameters at command line
davidjurado b83e614
Update config file to v2.0
davidjurado 4210f50
Fix dockerfile
davidjurado 9a6a739
Update Readme
davidjurado cecbc54
Fix Readme
davidjurado 933bfea
Bug fix: MLCube example never recognized it was running for the 1st t…
sergey-serebryakov 59bc4a0
Merge branch 'feature/boston_housing' of https://github.com/davidjura…
davidjurado d0f0baa
Update MLCube installation command in Readme
davidjurado File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
project/raw_dataset.txt | ||
project/processed_dataset.csv | ||
mlcube/workspace/data | ||
mlcube/run | ||
mlcube/tasks |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,226 @@ | ||
# Packing an existing project into MLCUbe | ||
|
||
In this tutorial we're going to use the [Boston Housing Dataset](https://www.cs.toronto.edu/~delve/data/boston/bostonDetail.html). We'll take an existing implementation, create the needed files to pack it into MLCube and execute all tasks. | ||
|
||
## Original project code | ||
|
||
At fist we have only 4 files, one for package dependencies and 3 scripts for each task: download data, preprocess data and train. | ||
|
||
```bash | ||
├── project | ||
├── 01_download_dataset.py | ||
├── 02_preprocess_dataset.py | ||
├── 03_train.py | ||
└── requirements.txt | ||
``` | ||
|
||
The most important thing that we need to remember about these scripts are the input parameters: | ||
|
||
* 01_download_dataset.py | ||
|
||
**--data_dir** : Dataset download path, inside this folder path a new file called raw_dataset.txt will be created. | ||
|
||
* 02_preprocess_dataset.py | ||
|
||
**--data_dir** : Folder path containing raw dataset file, when finished a new file called processed_dataset.csv will be created. | ||
|
||
* 03_train.py | ||
|
||
**--dataset_file_path** : Processed dataset file path. Note: this is the full path to the csv file. | ||
**--n_estimators** : Number of boosting stages to perform. In this case we're using a gradient boosting regressor. | ||
|
||
## MLCube scructure | ||
|
||
We'll need a couple of files for MLCube, first we'll need to create a folder called **mlcube** in the same path from as project folder. We'll need to create the following structure (for this tutorial the files are already in place) | ||
|
||
```bash | ||
├── mlcube | ||
│ ├── mlcube.yaml | ||
│ └── workspace | ||
│ └── parameters.yaml | ||
└── project | ||
├── 01_download_dataset.py | ||
├── 02_preprocess_dataset.py | ||
├── 03_train.py | ||
└── requirements.txt | ||
``` | ||
|
||
In the following steps we'll describe each file. | ||
|
||
## Define tasks execution scripts | ||
|
||
In general, we'll have a script for each task, and there are different ways to describe their execution from a main hanlder file, in this tutorial we'll use a function from the Python subprocess modeule: | ||
|
||
* subprocess.Popen() | ||
|
||
When we don't have input parameters for a Python script (or maybe just one) we can describe the execution of that script from Python code as follows: | ||
|
||
```Python | ||
import subprocess | ||
# Set the full command as variable | ||
command = "python my_task.py --single_parameter input" | ||
# Split the command, this will give us the list: | ||
# ['python', 'my_task.py', '--single_parameter', 'input'] | ||
splitted_command = command.split() | ||
# Execute the command as a new process | ||
process = subprocess.Popen(splitted_command, cwd=".") | ||
# Wait for the process to finish | ||
process.wait() | ||
``` | ||
|
||
### MLCube File: mlcube/workspace/parameters.yaml | ||
|
||
When we have a script with multiple input parameters, it will be hard to store the full command to execute it in a single variable, in this case we can create a shell script describing all the arguments and even add some extra fucntionalities, this will useful since we can define the input parameters as environment variables. | ||
|
||
We can use the **mlcube/workspace/parameters.yaml** file to describe all the input parameters we'll use (this file is already provided, please take a look and study its content), the idea is to describe all the parameters in this file and then use this single file as an input for the task. Then we can read the content of the parameters file in Python and set all the parameters as environment variables. Finally with the environment variables setted we can excute a shell script with our implementation. | ||
|
||
The way we execute all these steps in Python is described below. | ||
|
||
```Python | ||
import os | ||
import yaml | ||
# Read the file and store the parameters in a variable | ||
with open(parameters_file, 'r') as stream: | ||
parameters = yaml.safe_load(stream) | ||
# Get the system's enviroment | ||
env = os.environ.copy() | ||
# We can add a single new enviroment as follows | ||
env.update({ | ||
'NEW_ENV_VARIABLE': "my_new_env_variable", | ||
}) | ||
# Add all the parameters we got from the parameters file | ||
env.update(parameters) | ||
# Execute the shell script with the updated enviroment | ||
process = subprocess.Popen("./run_and_time.sh", cwd=".", env=env) | ||
# Wait for the process to finish | ||
process.wait() | ||
``` | ||
|
||
### Shell script | ||
|
||
In this tutorial we already have a shell script containing the steps to run the train task, the file is: **project/run_and_time.sh**, please take a look and study its content. | ||
|
||
### MLCube Command | ||
|
||
We are targeting pull-type installation, so MLCube images should be available on docker hub. If not, try this: | ||
|
||
```bash | ||
mlcube run ... -Pdocker.build_strategy=auto | ||
``` | ||
|
||
Parameters defined in mlcube.yaml can be overridden using: param=input, example: | ||
|
||
```bash | ||
mlcube run --task=download_data data_dir=absolute_path_to_custom_dir | ||
``` | ||
|
||
Also, users can override the workspace directory by using: | ||
|
||
```bash | ||
mlcube run --task=download_data --workspace=absolute_path_to_custom_dir | ||
``` | ||
|
||
Note: Sometimes, overriding the workspace path could fail for some task, this is because the input parameter parameters_file should be specified, to solve this use: | ||
|
||
```bash | ||
mlcube run --task=train --workspace=absolute_path_to_custom_dir parameters_file=$(pwd)/workspace/parameters.yaml | ||
``` | ||
|
||
### MLCube Python entrypoint file | ||
|
||
At this point we know how to execute the tasks sripts from Python code, now we can create a file that contains the definition on how to run each task. | ||
|
||
This file will be located in **project/mlcube.py**, this is the main file that will serve as the entrypoint to run all tasks. | ||
|
||
This file is already provided, please take a look and study its content. | ||
|
||
## Dockerize the project | ||
|
||
We'll create a Dockerfile with the needed steps to run the project, at the end we'll need to define the execution of the **mlcube.py** file as the entrypoint. This file will be located in **project/Dockerfile**. | ||
|
||
This file is already provided, please take a look and study its content. | ||
|
||
When creating the docker image, we'll need to run the docker build command inside the project folder, the command that we'll use is: | ||
|
||
`docker build . -t mlcommons/boston_housing:0.0.1 -f Dockerfile` | ||
|
||
Keep in mind the tag that we just described. | ||
|
||
At this point our solution folder structure should look like this: | ||
|
||
```bash | ||
├── mlcube | ||
│ ├── mlcube.yaml | ||
│ └── workspace | ||
│ └── parameters.yaml | ||
└── project | ||
├── 01_download_dataset.py | ||
├── 02_preprocess_dataset.py | ||
├── 03_train.py | ||
├── Dockerfile | ||
├── mlcube.py | ||
├── requirements.txt | ||
└── run_and_time.sh | ||
``` | ||
|
||
### Define MLCube files | ||
|
||
Inside the mlcube folder we'll need to define the following files. | ||
|
||
### mlcube/platforms/docker.yaml | ||
|
||
This file contains the description of the platform that we'll use to run MLCube, in this case is Docker. In the container definition we'll have the following subfields: | ||
|
||
* command: Main command to run, in this case is docker | ||
* run_args: In this field we'll define all the arguments to run the docker conatiner, e.g. --rm, --gpus, etc. | ||
* image: Image to use, in this case we'll need to use the same image tag from the docker build command. | ||
|
||
This file is already provided, please take a look and study its content. | ||
|
||
### MLCube task definition file | ||
|
||
The file located in **mlcube/mlcube.yaml** contains the definition of all the tasks and their parameters. | ||
|
||
This file is already provided, please take a look and study its content. | ||
|
||
With this file we have finished the packing of the project into MLCube! Now we can setup the project and run all the tasks. | ||
|
||
### Project setup | ||
|
||
## Project setup | ||
|
||
```bash | ||
# Create Python environment and install MLCube Docker runner | ||
virtualenv -p python3 ./env && source ./env/bin/activate && pip install mlcube-docker | ||
|
||
# Fetch the boston housing example from GitHub | ||
git clone https://github.com/mlcommons/mlcube_examples && cd ./mlcube_examples | ||
git fetch origin pull/27/head:feature/boston_housing && git checkout feature/boston_housing | ||
cd ./boston_housing/mlcube | ||
``` | ||
|
||
### Dataset | ||
|
||
The [Boston Housing Dataset](https://www.cs.toronto.edu/~delve/data/boston/bostonDetail.html) will be downloaded and processed. Sizes of the dataset in each step: | ||
|
||
| Dataset Step | MLCube Task | Format | Size | | ||
|--------------------------------|-------------------|------------|---------| | ||
| Downlaod (Compressed dataset) | download_data | txt file | ~52 KB | | ||
| Preprocess (Processed dataset) | preprocess_data | csv file | ~40 KB | | ||
| Total | (After all tasks) | All | ~92 KB | | ||
|
||
### Tasks execution | ||
|
||
```bash | ||
# Download Boston housing dataset. Default path = /workspace/data | ||
# To override it, use data_dir=DATA_DIR | ||
mlcube run --task download_data | ||
|
||
# Preprocess Boston housing dataset, this will convert raw .txt data to .csv format | ||
# It will use the DATA_DIR path defined in the previous step | ||
mlcube run --task preprocess_data | ||
|
||
# Run training. | ||
# Parameters to override: dataset_file_path=DATASET_FILE_PATH parameters_file=PATH_TO_TRAINING_PARAMS | ||
mlcube run --task train | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,32 @@ | ||
name: MLCommons Boston Housing | ||
description: MLCommons Boston Housing example | ||
authors: | ||
- {name: "MLCommons Best Practices Working Group"} | ||
|
||
platform: | ||
accelerator_count: 0 | ||
|
||
docker: | ||
# Image name. | ||
image: mlcommons/boston_housing:0.0.1 | ||
# Docker build context relative to $MLCUBE_ROOT. Default is `build`. | ||
build_context: "../project" | ||
# Docker file name within docker build context, default is `Dockerfile`. | ||
build_file: "Dockerfile" | ||
|
||
tasks: | ||
download_data: | ||
# Download boston housing dataset | ||
parameters: | ||
# Directory where dataset will be saved. | ||
outputs: {data_dir: data/} | ||
preprocess_data: | ||
# Preprocess dataset | ||
parameters: | ||
# Same directory location where dataset was downloaded | ||
inputs: {data_dir: data/} | ||
train: | ||
# Train gradient boosting regressor model | ||
parameters: | ||
# Processed dataset file | ||
inputs: {dataset_file_path: data/processed_dataset.csv, parameters_file: parameters.yaml} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
N_ESTIMATORS: "500" |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,34 @@ | ||
"""Download the raw Boston Housing Dataset""" | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Love the simple straight forward files you built here! |
||
import os | ||
import argparse | ||
import requests | ||
|
||
DATASET_URL = "http://lib.stat.cmu.edu/datasets/boston" | ||
|
||
|
||
def download_dataset(data_dir): | ||
"""Download dataset and store it in a given path. | ||
Args: | ||
data_dir (str): Dataset download path.""" | ||
|
||
request = requests.get(DATASET_URL) | ||
file_name = "raw_dataset.txt" | ||
file_path = os.path.join(data_dir, file_name) | ||
with open(file_path,'wb') as f: | ||
f.write(request.content) | ||
print(f"\nRaw dataset saved at: {file_path}") | ||
|
||
|
||
def main(): | ||
|
||
parser = argparse.ArgumentParser(description='Download dataset') | ||
parser.add_argument('--data_dir', required=True, | ||
help='Dataset download path') | ||
args = parser.parse_args() | ||
|
||
data_dir = args.data_dir | ||
download_dataset(data_dir) | ||
|
||
|
||
if __name__ == '__main__': | ||
main() |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,39 @@ | ||
"""Preprocess the dataset and save in CSV format""" | ||
import os | ||
import argparse | ||
import pandas as pd | ||
|
||
def process_data(data_dir): | ||
"""Process raw dataset and save it in CSV format. | ||
Args: | ||
data_dir (str): Folder path containing dataset.""" | ||
|
||
col_names = ["CRIM", "ZN", "INDUS", "CHAS", "NOX", "RM", "AGE", "DIS", "RAD", "TAX", "PTRATIO", "B", "LSTAT", "PRICE"] | ||
raw_file = os.path.join(data_dir, "raw_dataset.txt") | ||
print(f"\nProcessing raw file: {raw_file}") | ||
|
||
df = pd.read_csv(raw_file, skiprows=22, header=None, delim_whitespace=True) | ||
df_even=df[df.index%2==0].reset_index(drop=True) | ||
df_odd=df[df.index%2==1].iloc[: , :3].reset_index(drop=True) | ||
df_odd.columns = [11,12,13] | ||
dataset = df_even.join(df_odd) | ||
dataset.columns = col_names | ||
|
||
output_file = os.path.join(data_dir, "processed_dataset.csv") | ||
dataset.to_csv(output_file, index=False) | ||
print(f"Processed dataset saved at: {output_file}") | ||
|
||
|
||
def main(): | ||
|
||
parser = argparse.ArgumentParser(description='Preprocess dataset') | ||
parser.add_argument('--data_dir', required=True, | ||
help='Folder containing dataset file') | ||
args = parser.parse_args() | ||
|
||
data_dir = args.data_dir | ||
process_data(data_dir) | ||
|
||
|
||
if __name__ == '__main__': | ||
main() |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,46 @@ | ||
"""Train gradient boosting regressor on Boston housing dataset""" | ||
import os | ||
import argparse | ||
import pandas as pd | ||
from sklearn.model_selection import train_test_split | ||
from sklearn.metrics import mean_squared_error | ||
from sklearn.ensemble import GradientBoostingRegressor | ||
|
||
|
||
def train(dataset_file_path, n_estimators): | ||
df = pd.read_csv(dataset_file_path) | ||
|
||
data = df.drop(['PRICE'], axis=1) | ||
target = df[['PRICE']] | ||
X_train, X_test, Y_train, Y_test = train_test_split(data, target, test_size = 0.25) | ||
|
||
clf = GradientBoostingRegressor(n_estimators=n_estimators, verbose = 1) | ||
clf.fit(X_train, Y_train.values.ravel()) | ||
|
||
train_predicted = clf.predict(X_train) | ||
train_expected = Y_train | ||
train_rmse = mean_squared_error(train_predicted, train_expected, squared=False) | ||
|
||
test_predicted = clf.predict(X_test) | ||
test_expected = Y_test | ||
test_rmse = mean_squared_error(test_predicted, test_expected, squared=False) | ||
|
||
print(f"\nTRAIN RMSE:\t{train_rmse}") | ||
print(f"TEST RMSE:\t{test_rmse}") | ||
|
||
def main(): | ||
|
||
parser = argparse.ArgumentParser(description='Train model') | ||
parser.add_argument('--dataset_file_path', required=True, | ||
help='Processed dataset file path') | ||
parser.add_argument('--n_estimators', type=int, default=100, | ||
help='number of boosting stages to perform') | ||
args = parser.parse_args() | ||
|
||
dataset_file_path = args.dataset_file_path | ||
n_estimators = args.n_estimators | ||
train(dataset_file_path, n_estimators) | ||
|
||
|
||
if __name__ == '__main__': | ||
main() |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great job going through explaining things in detail!