Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update instructions to always submit model definition files as zip #130

Open
wants to merge 3 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
69 changes: 41 additions & 28 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,30 +26,28 @@ To know more about the architectural details, please read the [design document](

* `S3 CLI`: The [command-line interface](https://aws.amazon.com/cli/) to configure your Object Storage

* An existing Kubernetes cluster (e.g., [Kubeadm-DIND](https://github.com/kubernetes-sigs/kubeadm-dind-cluster#using-preconfigured-scripts) for local testing).
* An existing Kubernetes cluster (e.g., [Kubeadm-DIND](https://github.com/kubernetes-sigs/kubeadm-dind-cluster#using-preconfigured-scripts) for local testing or follow the appropriate instructions for standing up your Kubernetes cluster using [IBM Cloud Public](https://github.com/IBM/container-journey-template/blob/master/README.md) or [IBM Cloud Private](https://github.com/IBM/deploy-ibm-cloud-private/blob/master/README.md)).
<!-- For Minikube, use the command `make minikube` to start Minikube and set up local network routes. Minikube **v0.25.1** is tested with Travis CI. -->

* Follow the appropriate instructions for standing up your Kubernetes cluster using [IBM Cloud Public](https://github.com/IBM/container-journey-template/blob/master/README.md) or [IBM Cloud Private](https://github.com/IBM/deploy-ibm-cloud-private/blob/master/README.md)

* The minimum capacity requirement for FfDL is 4GB Memory and 3 CPUs.

## Usage Scenarios

* If you have a FfDL deployment up and running, you can jump to [FfDL User Guide](docs/user-guide.md) to use FfDL for training your deep learning models.
* **[FfDL User Guide](docs/user-guide.md)**: If you have a FfDL deployment up and running, you can jump to [FfDL User Guide](docs/user-guide.md) to use FfDL for training your deep learning models.

* If you have FfDL confiugured to use GPUs, and want to train using GPUs, follow steps [here](docs/gpu-guide.md)
* **[Training with GPUs](docs/gpu-guide.md)**: If you have FfDL configured to use GPUs, and want to train using GPUs, follow steps [here](docs/gpu-guide.md)

* If you have used FfDL to train your models, and want to use a GPU enabled public cloud hosted service for further training and serving, please follow instructions [here](etc/converter/ffdl-wml.md) to train and serve your models using [Watson Studio Deep Learning](https://www.ibm.com/cloud/deep-learning) service.
* **[Conversion from WML service](etc/converter/ffdl-wml.md)**: If you have used FfDL to train your models, and want to use a GPU enabled public cloud hosted service for further training and serving, please follow instructions [here](etc/converter/ffdl-wml.md) to train and serve your models using [Watson Studio Deep Learning](https://www.ibm.com/cloud/deep-learning) service.

* If you are getting started and want to setup your own FfDL deployment, please follow the steps [below](#1-quick-start).
* **[FfDL QuickStart](#1-quick-start)**: If you are getting started and want to setup your own FfDL deployment, please follow the steps [below](#1-quick-start).

* If you want to leverage Jupyter notebooks to launch training on your FfDL cluster, please follow [these instructions](etc/notebooks/art)
* **[leveraging with Jupyter notebooks](etc/notebooks/art)**: If you want to leverage Jupyter notebooks to launch training on your FfDL cluster, please follow [these instructions](etc/notebooks/art)

* To invoke [Adversarial Robustness Toolbox](https://github.com/IBM/adversarial-robustness-toolbox) to find vulnerabilities in your models, follow the [instructions here](etc/notebooks/art)
* **[Adversarial Robustness Toolbox Integration](etc/notebooks/art)**: To invoke [Adversarial Robustness Toolbox](https://github.com/IBM/adversarial-robustness-toolbox) to find vulnerabilities in your models, follow the [instructions here](etc/notebooks/art)

* To deploy your trained models, follow [the integration guide with Seldon](community/FfDL-Seldon)
* **[Model Deployment with Seldon](community/FfDL-Seldon)**: To deploy your trained models, follow [the integration guide with Seldon](community/FfDL-Seldon)

* If you are looking for related collateral, slides, webinars, blogs and other materials related to FfDL, please [find them here](demos)
* **[Demos and Publications](demos)**: If you are looking for related collateral, slides, webinars, blogs and other materials related to FfDL, please [find them here](demos)

## Steps

Expand Down Expand Up @@ -268,8 +266,6 @@ export AWS_ACCESS_KEY_ID=test; export AWS_SECRET_ACCESS_KEY=test; export AWS_DEF
s3cmd="aws --endpoint-url=$s3_url s3"
$s3cmd mb s3://tf_training_data
$s3cmd mb s3://tf_trained_model
$s3cmd mb s3://mnist_lmdb_data
$s3cmd mb s3://dlaas-trained-models
```

3. Now, create a temporary repository, download the necessary images for training and labeling our TensorFlow model, and upload those images
Expand All @@ -293,41 +289,44 @@ restapi_port=$(kubectl get service ffdl-restapi -o jsonpath='{.spec.ports[0].nod
export DLAAS_URL=http://$node_ip:$restapi_port; export DLAAS_USERNAME=test-user; export DLAAS_PASSWORD=test;
```

Replace the default object storage path with your s3_url. You can skip this step if your already modified the object storage path with your s3_url.
Create a temporary manifest file and replace the default object storage path with your s3_url. You can skip this step if your already modified the object storage path with your s3_url.
```shell
cp etc/examples/tf-model/manifest.yml etc/examples/tf-model/manifest-temp.yml
if [ "$(uname)" = "Darwin" ]; then
sed -i '' s/s3.default.svc.cluster.local/$node_ip:$s3_port/ etc/examples/tf-model/manifest.yml
else
sed -i s/s3.default.svc.cluster.local/$node_ip:$s3_port/ etc/examples/tf-model/manifest.yml
fi
```

Now, put all your model definition files into a zip file.
```shell
# Replace tf-model with the model you want to zip
pushd etc/examples/tf-model && zip ../tf-model.zip * && popd
```

Define the FfDL command line interface and run the training job with our default TensorFlow model
```shell
CLI_CMD=$(pwd)/cli/bin/ffdl-$(if [ "$(uname)" = "Darwin" ]; then echo 'osx'; else echo 'linux'; fi)
$CLI_CMD train etc/examples/tf-model/manifest.yml etc/examples/tf-model
$CLI_CMD train etc/examples/tf-model/manifest-temp.yml etc/examples/tf-model.zip
```

Congratulations, you had submitted your first job on FfDL. You can check your FfDL status either from the FfDL UI or simply run `$CLI_CMD list`

> You can learn about how to create your own model definition files and `manifest.yaml` at [user guild](docs/user-guide.md#2-create-new-models-with-ffdl).

5. If you want to run your job via the FfDL UI, simply run the below command to create your model zip file.

```shell
# Replace tf-model with the model you want to zip
pushd etc/examples/tf-model && zip ../tf-model.zip * && popd
```
5. If you want to run your job via the FfDL UI, simply upload `tf-model.zip` and `manifest.yml` (The default TensorFlow model) in the `etc/examples/` repository as shown below.

Then, upload `tf-model.zip` and `manifest.yml` (The default TensorFlow model) in the `etc/examples/` repository as shown below.
Then, click `Submit Training Job` to run your job.

![ui-example](docs/images/ui-example.png)

6. (Optional) Since it's simple and straightforward to submit jobs with different deep learning framework on FfDL, let's try to run a Caffe Job. Download all the necessary training and testing images in [LMDB format](https://en.wikipedia.org/wiki/Lightning_Memory-Mapped_Database) for our Caffe model
and upload those images to your mnist_lmdb_data bucket.
and upload those images to your mnist_lmdb_data bucket. Then replace the object storage endpoint and package the model definition files into zip.

```shell
$s3cmd mb s3://mnist_lmdb_data
$s3cmd mb s3://dlaas-trained-models
for phase in train test;
do
for file in data.mdb lock.mdb;
Expand All @@ -337,12 +336,19 @@ do
$s3cmd cp $tmpfile s3://mnist_lmdb_data/$phase/$file
done
done

cp etc/examples/caffe-model/manifest.yml etc/caffe-model/tf-model/manifest-temp.yml
if [ "$(uname)" = "Darwin" ]; then
sed -i '' s/s3.default.svc.cluster.local/$node_ip:$s3_port/ etc/caffe-model/tf-model/manifest-temp.yml
else
sed -i s/s3.default.svc.cluster.local/$node_ip:$s3_port/ etc/caffe-model/tf-model/manifest-temp.yml
fi
```

7. Now train your Caffe Job.
Now train your Caffe Job.

```shell
$CLI_CMD train etc/examples/caffe-model/manifest.yml etc/examples/caffe-model
$CLI_CMD train etc/examples/caffe-model/manifest-temp.yml etc/examples/caffe-model/zip
```

Congratulations, now you know how to deploy jobs with different deep learning framework. To learn more about your job execution results,
Expand Down Expand Up @@ -406,16 +412,23 @@ fi
```

6. Now you should have all the necessary training data set in your training data bucket. Let's go ahead to set up your restapi endpoint
and default credentials for Deep Learning as a Service. Once you done that, you can start running jobs using the FfDL CLI (executable
binary).
and default credentials for Deep Learning as a Service. Once you done that, put all your model definition files into a zip file and
you can start running jobs using the FfDL CLI (executable binary).

Put all your model definition files into a zip file.
```shell
# Replace tf-model with the model you want to zip
pushd etc/examples/tf-model && zip ../tf-model.zip * && popd
```

Define the necessary environment variables and execute the training job.
```shell
restapi_port=$(kubectl get service ffdl-restapi -o jsonpath='{.spec.ports[0].nodePort}')
export DLAAS_URL=http://$PUBLIC_IP:$restapi_port; export DLAAS_USERNAME=test-user; export DLAAS_PASSWORD=test;

# Obtain the correct CLI for your machine and run the training job with our default TensorFlow model
CLI_CMD=cli/bin/ffdl-$(if [ "$(uname)" = "Darwin" ]; then echo 'osx'; else echo 'linux'; fi)
$CLI_CMD train etc/examples/tf-model/manifest.yml etc/examples/tf-model
$CLI_CMD train etc/examples/tf-model/manifest.yml etc/examples/tf-model.zip
```

## 7. Clean Up
Expand Down
15 changes: 11 additions & 4 deletions community/FfDL-H2Oai/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -64,19 +64,26 @@ restapi_port=$(kubectl get service ffdl-restapi -o jsonpath='{.spec.ports[0].nod
export DLAAS_URL=http://$node_ip:$restapi_port; export DLAAS_USERNAME=test-user; export DLAAS_PASSWORD=test;
```

Replace the default object storage path with your s3_url. You can skip this step if your already modified the object storage path with your s3_url.
Create a temporary manifest file and replace the default object storage path with your s3_url. You can skip this step if your already modified the object storage path with your s3_url.
```shell
cp community/FfDL-H2Oai/h2o-model/manifest-h2o.yml community/FfDL-H2Oai/h2o-model/manifest-h2o-temp.yml
if [ "$(uname)" = "Darwin" ]; then
sed -i '' s/s3.default.svc.cluster.local/$node_ip:$s3_port/ community/FfDL-H2Oai/h2o-model/manifest-h2o.yml
sed -i '' s/s3.default.svc.cluster.local/$node_ip:$s3_port/ community/FfDL-H2Oai/h2o-model/manifest-h2o-temp.yml
else
sed -i s/s3.default.svc.cluster.local/$node_ip:$s3_port/ community/FfDL-H2Oai/h2o-model/manifest-h2o.yml
sed -i s/s3.default.svc.cluster.local/$node_ip:$s3_port/ community/FfDL-H2Oai/h2o-model/manifest-h2o-temp.yml
fi
```

Now, put all your model definition files into a zip file.
```shell
# Replace tf-model with the model you want to zip
pushd community/FfDL-H2Oai/h2o-model && zip ../h2o-model.zip * && popd
```

Obtain the correct CLI for your machine and run the training job with our default H2O model
```shell
CLI_CMD=$(pwd)/cli/bin/ffdl-$(if [ "$(uname)" = "Darwin" ]; then echo 'osx'; else echo 'linux'; fi)
$CLI_CMD train community/FfDL-H2Oai/h2o-model/manifest-h2o.yml community/FfDL-H2Oai/h2o-model
$CLI_CMD train community/FfDL-H2Oai/h2o-model/manifest-h2o-temp.yml community/FfDL-H2Oai/h2o-model.zip
```

Congratulations, you had submitted your first H2O job on FfDL. You can check your FfDL status either from the FfDL UI or simply run `$CLI_CMD list`
7 changes: 3 additions & 4 deletions demos/fashion-mnist-training/fashion-train/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -64,17 +64,16 @@ Create a .yml file with the necessary information. manifest.yml has further inst
We will now go back to this directory and deploy our training job to FfDL using the path to the .yml and path to the folder containing the experiment.py
```bash
cd <path to this demo repo>/fashion-train
# Replace manifest.yml with the path to your .yml file
# Replace Image with the path to the folder containing your file created in step 6
$CLI_CMD train manifest.yml fashion-training
pushd fashion-training && zip ../fashion-training.zip * && popd # Put all your model definition files into a zip file.
$CLI_CMD train manifest.yml fashion-training.zip # Replace manifest.yml and fashion-training.zip with the path to your .yml and .zip files
```
## Step 3b - Deploying the training job to FfDL using the FfDL UI

Alternatively, the FfDL UI can be used to deploy jobs. First zip your model.
```bash
# Replace fashion-training with the path to your training file folder
# Replace fashion-training.zip with the path where you want the .zip file stored
pushd fashion-training && zip fashion-training * && popd
pushd fashion-training && zip ../fashion-training * && popd
```

Go to FfDL web UI. Upload the .zip to "Choose model definition zip to upload". Upload the .yml to "Choose manifest to upload". Then click Submit Training Job.
Expand Down
3 changes: 1 addition & 2 deletions docs/user-guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -94,9 +94,8 @@ Here are the [example manifest files](../etc/examples/tf-model/manifest.yml) for
--test_images_file ${DATA_DIR}/t10k-images-idx3-ubyte.gz

### 2.5. Creating Model zip file
**Note** that FfDL CLI can take both zip or unzip files.

You need to zip all the model definition files and create a model zip file for jobs submitting on FfDL UI. At present, FfDL UI only supports zip format for model files, other compression formats like gzip, bzip, tar etc., are not supported. **Note** that all model definition files has to be in the first level of the zip file and there are no nested directories in the zip file.
You need to zip all the model definition files and create a model zip file for jobs submitting on FfDL. At present, FfDL only supports zip format for model files, other compression formats like gzip, bzip, tar etc., are not supported. **Note** that all model definition files has to be in the first level of the zip file and there are no nested directories in the zip file.

### 2.6. Model Deployment and Training
After creating the manifest file and model definition file, you can either use the FfDL CLI or FfDL UI to deploy your model.
Expand Down