diff --git a/samples/xgboost-spark/README.md b/samples/xgboost-spark/README.md index 54ef2b3a51c..1754f36b224 100644 --- a/samples/xgboost-spark/README.md +++ b/samples/xgboost-spark/README.md @@ -1,28 +1,34 @@ ## Overview -The pipeline creates XGBoost models on structured data with CSV format. Both classification and regression are supported. -The pipeline starts by creating an Google DataProc cluster, and then run analysis, transormation, distributed training and +The `xgboost-training-cm.py` pipeline creates XGBoost models on structured data in CSV format. Both classification and regression are supported. + +The pipeline starts by creating an Google DataProc cluster, and then running analysis, transformation, distributed training and prediction in the created cluster. Then a single node confusion-matrix aggregator is used (for classification case) to -provide frontend the confusion matrix data. At the end, a delete cluster operation runs to destroy the cluster it creates -in the beginning. The delete cluster operation is used as an exit handler, meaning it will run regardless the pipeline fails +provide the confusion matrix data to the front end. Finally, a delete cluster operation runs to destroy the cluster it creates +in the beginning. The delete cluster operation is used as an exit handler, meaning it will run regardless of whether the pipeline fails or not. ## Requirements -Preprocessing uses Google Cloud DataProc. So the [DataProc API](https://cloud.google.com/endpoints/docs/openapi/enable-api) needs to be enabled for the given project. + +Preprocessing uses Google Cloud DataProc. Therefore, you must enable the [DataProc API](https://cloud.google.com/endpoints/docs/openapi/enable-api) for the given GCP project. ## Compile -Follow [README.md](https://github.com/kubeflow/pipelines/blob/master/samples/README.md) to install the compiler and -compile your sample python into workflow yaml. + +Follow the guide to [building a pipeline](https://github.com/kubeflow/pipelines/wiki/Build-a-Pipeline) to install the Kubeflow Pipelines SDK and compile the sample Python into a workflow specification. The specification takes the form of a YAML file compressed into a `.tar.gz` file. ## Deploy -Open the ML pipeline UI. Create a new pipeline, and then upload the compiled YAML file as a new pipeline template. + +Open the Kubeflow pipelines UI. Create a new pipeline, and then upload the compiled specification (`.tar.gz` file) as a new pipeline template. ## Run -Most arguments come with default values. Only "output" and "project" need to be filled always. "output" is a Google Storage path which holds -pipeline run results. Note that each pipeline run will create a unique directory under output so it will not override previous results. "project" -is a GCP project. -## Components Source +Most arguments come with default values. Only `output` and `project` need to be filled always. + +* `output` is a Google Storage path which holds +pipeline run results. Note that each pipeline run will create a unique directory under `output` so it will not override previous results. +* `project` is a GCP project. + +## Components source Create Cluster: [source code](https://github.com/kubeflow/pipelines/tree/master/components/dataproc/xgboost/create_cluster)