diff --git a/README.md b/README.md index 0903a914..d4d678cd 100644 --- a/README.md +++ b/README.md @@ -26,30 +26,28 @@ To know more about the architectural details, please read the [design document]( * `S3 CLI`: The [command-line interface](https://aws.amazon.com/cli/) to configure your Object Storage -* An existing Kubernetes cluster (e.g., [Kubeadm-DIND](https://github.com/kubernetes-sigs/kubeadm-dind-cluster#using-preconfigured-scripts) for local testing). +* An existing Kubernetes cluster (e.g., [Kubeadm-DIND](https://github.com/kubernetes-sigs/kubeadm-dind-cluster#using-preconfigured-scripts) for local testing or follow the appropriate instructions for standing up your Kubernetes cluster using [IBM Cloud Public](https://github.com/IBM/container-journey-template/blob/master/README.md) or [IBM Cloud Private](https://github.com/IBM/deploy-ibm-cloud-private/blob/master/README.md)). -* Follow the appropriate instructions for standing up your Kubernetes cluster using [IBM Cloud Public](https://github.com/IBM/container-journey-template/blob/master/README.md) or [IBM Cloud Private](https://github.com/IBM/deploy-ibm-cloud-private/blob/master/README.md) - * The minimum capacity requirement for FfDL is 4GB Memory and 3 CPUs. ## Usage Scenarios -* If you have a FfDL deployment up and running, you can jump to [FfDL User Guide](docs/user-guide.md) to use FfDL for training your deep learning models. +* **[FfDL User Guide](docs/user-guide.md)**: If you have a FfDL deployment up and running, you can jump to [FfDL User Guide](docs/user-guide.md) to use FfDL for training your deep learning models. -* If you have FfDL confiugured to use GPUs, and want to train using GPUs, follow steps [here](docs/gpu-guide.md) +* **[Training with GPUs](docs/gpu-guide.md)**: If you have FfDL configured to use GPUs, and want to train using GPUs, follow steps [here](docs/gpu-guide.md) -* If you have used FfDL to train your models, and want to use a GPU enabled public cloud hosted service for further training and serving, please follow instructions [here](etc/converter/ffdl-wml.md) to train and serve your models using [Watson Studio Deep Learning](https://www.ibm.com/cloud/deep-learning) service. +* **[Conversion from WML service](etc/converter/ffdl-wml.md)**: If you have used FfDL to train your models, and want to use a GPU enabled public cloud hosted service for further training and serving, please follow instructions [here](etc/converter/ffdl-wml.md) to train and serve your models using [Watson Studio Deep Learning](https://www.ibm.com/cloud/deep-learning) service. -* If you are getting started and want to setup your own FfDL deployment, please follow the steps [below](#1-quick-start). +* **[FfDL QuickStart](#1-quick-start)**: If you are getting started and want to setup your own FfDL deployment, please follow the steps [below](#1-quick-start). -* If you want to leverage Jupyter notebooks to launch training on your FfDL cluster, please follow [these instructions](etc/notebooks/art) +* **[leveraging with Jupyter notebooks](etc/notebooks/art)**: If you want to leverage Jupyter notebooks to launch training on your FfDL cluster, please follow [these instructions](etc/notebooks/art) -* To invoke [Adversarial Robustness Toolbox](https://github.com/IBM/adversarial-robustness-toolbox) to find vulnerabilities in your models, follow the [instructions here](etc/notebooks/art) +* **[Adversarial Robustness Toolbox Integration](etc/notebooks/art)**: To invoke [Adversarial Robustness Toolbox](https://github.com/IBM/adversarial-robustness-toolbox) to find vulnerabilities in your models, follow the [instructions here](etc/notebooks/art) -* To deploy your trained models, follow [the integration guide with Seldon](community/FfDL-Seldon) +* **[Model Deployment with Seldon](community/FfDL-Seldon)**: To deploy your trained models, follow [the integration guide with Seldon](community/FfDL-Seldon) -* If you are looking for related collateral, slides, webinars, blogs and other materials related to FfDL, please [find them here](demos) +* **[Demos and Publications](demos)**: If you are looking for related collateral, slides, webinars, blogs and other materials related to FfDL, please [find them here](demos) ## Steps @@ -268,8 +266,6 @@ export AWS_ACCESS_KEY_ID=test; export AWS_SECRET_ACCESS_KEY=test; export AWS_DEF s3cmd="aws --endpoint-url=$s3_url s3" $s3cmd mb s3://tf_training_data $s3cmd mb s3://tf_trained_model -$s3cmd mb s3://mnist_lmdb_data -$s3cmd mb s3://dlaas-trained-models ``` 3. Now, create a temporary repository, download the necessary images for training and labeling our TensorFlow model, and upload those images @@ -293,8 +289,9 @@ restapi_port=$(kubectl get service ffdl-restapi -o jsonpath='{.spec.ports[0].nod export DLAAS_URL=http://$node_ip:$restapi_port; export DLAAS_USERNAME=test-user; export DLAAS_PASSWORD=test; ``` -Replace the default object storage path with your s3_url. You can skip this step if your already modified the object storage path with your s3_url. +Create a temporary manifest file and replace the default object storage path with your s3_url. You can skip this step if your already modified the object storage path with your s3_url. ```shell +cp etc/examples/tf-model/manifest.yml etc/examples/tf-model/manifest-temp.yml if [ "$(uname)" = "Darwin" ]; then sed -i '' s/s3.default.svc.cluster.local/$node_ip:$s3_port/ etc/examples/tf-model/manifest.yml else @@ -302,32 +299,34 @@ else fi ``` +Now, put all your model definition files into a zip file. +```shell +# Replace tf-model with the model you want to zip +pushd etc/examples/tf-model && zip ../tf-model.zip * && popd +``` + Define the FfDL command line interface and run the training job with our default TensorFlow model ```shell CLI_CMD=$(pwd)/cli/bin/ffdl-$(if [ "$(uname)" = "Darwin" ]; then echo 'osx'; else echo 'linux'; fi) -$CLI_CMD train etc/examples/tf-model/manifest.yml etc/examples/tf-model +$CLI_CMD train etc/examples/tf-model/manifest-temp.yml etc/examples/tf-model.zip ``` Congratulations, you had submitted your first job on FfDL. You can check your FfDL status either from the FfDL UI or simply run `$CLI_CMD list` > You can learn about how to create your own model definition files and `manifest.yaml` at [user guild](docs/user-guide.md#2-create-new-models-with-ffdl). -5. If you want to run your job via the FfDL UI, simply run the below command to create your model zip file. - -```shell -# Replace tf-model with the model you want to zip -pushd etc/examples/tf-model && zip ../tf-model.zip * && popd -``` +5. If you want to run your job via the FfDL UI, simply upload `tf-model.zip` and `manifest.yml` (The default TensorFlow model) in the `etc/examples/` repository as shown below. -Then, upload `tf-model.zip` and `manifest.yml` (The default TensorFlow model) in the `etc/examples/` repository as shown below. Then, click `Submit Training Job` to run your job. ![ui-example](docs/images/ui-example.png) 6. (Optional) Since it's simple and straightforward to submit jobs with different deep learning framework on FfDL, let's try to run a Caffe Job. Download all the necessary training and testing images in [LMDB format](https://en.wikipedia.org/wiki/Lightning_Memory-Mapped_Database) for our Caffe model -and upload those images to your mnist_lmdb_data bucket. +and upload those images to your mnist_lmdb_data bucket. Then replace the object storage endpoint and package the model definition files into zip. ```shell +$s3cmd mb s3://mnist_lmdb_data +$s3cmd mb s3://dlaas-trained-models for phase in train test; do for file in data.mdb lock.mdb; @@ -337,12 +336,19 @@ do $s3cmd cp $tmpfile s3://mnist_lmdb_data/$phase/$file done done + +cp etc/examples/caffe-model/manifest.yml etc/caffe-model/tf-model/manifest-temp.yml +if [ "$(uname)" = "Darwin" ]; then + sed -i '' s/s3.default.svc.cluster.local/$node_ip:$s3_port/ etc/caffe-model/tf-model/manifest-temp.yml +else + sed -i s/s3.default.svc.cluster.local/$node_ip:$s3_port/ etc/caffe-model/tf-model/manifest-temp.yml +fi ``` -7. Now train your Caffe Job. +Now train your Caffe Job. ```shell -$CLI_CMD train etc/examples/caffe-model/manifest.yml etc/examples/caffe-model +$CLI_CMD train etc/examples/caffe-model/manifest-temp.yml etc/examples/caffe-model/zip ``` Congratulations, now you know how to deploy jobs with different deep learning framework. To learn more about your job execution results, @@ -406,16 +412,23 @@ fi ``` 6. Now you should have all the necessary training data set in your training data bucket. Let's go ahead to set up your restapi endpoint -and default credentials for Deep Learning as a Service. Once you done that, you can start running jobs using the FfDL CLI (executable -binary). +and default credentials for Deep Learning as a Service. Once you done that, put all your model definition files into a zip file and +you can start running jobs using the FfDL CLI (executable binary). + +Put all your model definition files into a zip file. +```shell +# Replace tf-model with the model you want to zip +pushd etc/examples/tf-model && zip ../tf-model.zip * && popd +``` +Define the necessary environment variables and execute the training job. ```shell restapi_port=$(kubectl get service ffdl-restapi -o jsonpath='{.spec.ports[0].nodePort}') export DLAAS_URL=http://$PUBLIC_IP:$restapi_port; export DLAAS_USERNAME=test-user; export DLAAS_PASSWORD=test; # Obtain the correct CLI for your machine and run the training job with our default TensorFlow model CLI_CMD=cli/bin/ffdl-$(if [ "$(uname)" = "Darwin" ]; then echo 'osx'; else echo 'linux'; fi) -$CLI_CMD train etc/examples/tf-model/manifest.yml etc/examples/tf-model +$CLI_CMD train etc/examples/tf-model/manifest.yml etc/examples/tf-model.zip ``` ## 7. Clean Up diff --git a/community/FfDL-H2Oai/README.md b/community/FfDL-H2Oai/README.md index 7ebb9dea..d2276642 100644 --- a/community/FfDL-H2Oai/README.md +++ b/community/FfDL-H2Oai/README.md @@ -64,19 +64,26 @@ restapi_port=$(kubectl get service ffdl-restapi -o jsonpath='{.spec.ports[0].nod export DLAAS_URL=http://$node_ip:$restapi_port; export DLAAS_USERNAME=test-user; export DLAAS_PASSWORD=test; ``` -Replace the default object storage path with your s3_url. You can skip this step if your already modified the object storage path with your s3_url. +Create a temporary manifest file and replace the default object storage path with your s3_url. You can skip this step if your already modified the object storage path with your s3_url. ```shell +cp community/FfDL-H2Oai/h2o-model/manifest-h2o.yml community/FfDL-H2Oai/h2o-model/manifest-h2o-temp.yml if [ "$(uname)" = "Darwin" ]; then - sed -i '' s/s3.default.svc.cluster.local/$node_ip:$s3_port/ community/FfDL-H2Oai/h2o-model/manifest-h2o.yml + sed -i '' s/s3.default.svc.cluster.local/$node_ip:$s3_port/ community/FfDL-H2Oai/h2o-model/manifest-h2o-temp.yml else - sed -i s/s3.default.svc.cluster.local/$node_ip:$s3_port/ community/FfDL-H2Oai/h2o-model/manifest-h2o.yml + sed -i s/s3.default.svc.cluster.local/$node_ip:$s3_port/ community/FfDL-H2Oai/h2o-model/manifest-h2o-temp.yml fi ``` +Now, put all your model definition files into a zip file. +```shell +# Replace tf-model with the model you want to zip +pushd community/FfDL-H2Oai/h2o-model && zip ../h2o-model.zip * && popd +``` + Obtain the correct CLI for your machine and run the training job with our default H2O model ```shell CLI_CMD=$(pwd)/cli/bin/ffdl-$(if [ "$(uname)" = "Darwin" ]; then echo 'osx'; else echo 'linux'; fi) -$CLI_CMD train community/FfDL-H2Oai/h2o-model/manifest-h2o.yml community/FfDL-H2Oai/h2o-model +$CLI_CMD train community/FfDL-H2Oai/h2o-model/manifest-h2o-temp.yml community/FfDL-H2Oai/h2o-model.zip ``` Congratulations, you had submitted your first H2O job on FfDL. You can check your FfDL status either from the FfDL UI or simply run `$CLI_CMD list` diff --git a/demos/fashion-mnist-training/fashion-train/README.md b/demos/fashion-mnist-training/fashion-train/README.md index 77e23548..e9e31b2c 100644 --- a/demos/fashion-mnist-training/fashion-train/README.md +++ b/demos/fashion-mnist-training/fashion-train/README.md @@ -64,9 +64,8 @@ Create a .yml file with the necessary information. manifest.yml has further inst We will now go back to this directory and deploy our training job to FfDL using the path to the .yml and path to the folder containing the experiment.py ```bash cd /fashion-train -# Replace manifest.yml with the path to your .yml file -# Replace Image with the path to the folder containing your file created in step 6 -$CLI_CMD train manifest.yml fashion-training +pushd fashion-training && zip ../fashion-training.zip * && popd # Put all your model definition files into a zip file. +$CLI_CMD train manifest.yml fashion-training.zip # Replace manifest.yml and fashion-training.zip with the path to your .yml and .zip files ``` ## Step 3b - Deploying the training job to FfDL using the FfDL UI @@ -74,7 +73,7 @@ Alternatively, the FfDL UI can be used to deploy jobs. First zip your model. ```bash # Replace fashion-training with the path to your training file folder # Replace fashion-training.zip with the path where you want the .zip file stored -pushd fashion-training && zip fashion-training * && popd +pushd fashion-training && zip ../fashion-training * && popd ``` Go to FfDL web UI. Upload the .zip to "Choose model definition zip to upload". Upload the .yml to "Choose manifest to upload". Then click Submit Training Job. diff --git a/docs/user-guide.md b/docs/user-guide.md index b7f8caf9..507320af 100644 --- a/docs/user-guide.md +++ b/docs/user-guide.md @@ -94,9 +94,8 @@ Here are the [example manifest files](../etc/examples/tf-model/manifest.yml) for --test_images_file ${DATA_DIR}/t10k-images-idx3-ubyte.gz ### 2.5. Creating Model zip file -**Note** that FfDL CLI can take both zip or unzip files. -You need to zip all the model definition files and create a model zip file for jobs submitting on FfDL UI. At present, FfDL UI only supports zip format for model files, other compression formats like gzip, bzip, tar etc., are not supported. **Note** that all model definition files has to be in the first level of the zip file and there are no nested directories in the zip file. +You need to zip all the model definition files and create a model zip file for jobs submitting on FfDL. At present, FfDL only supports zip format for model files, other compression formats like gzip, bzip, tar etc., are not supported. **Note** that all model definition files has to be in the first level of the zip file and there are no nested directories in the zip file. ### 2.6. Model Deployment and Training After creating the manifest file and model definition file, you can either use the FfDL CLI or FfDL UI to deploy your model.