Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updated the xgboost-spark sample README #132

Merged
merged 1 commit into from
Nov 7, 2018
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
30 changes: 18 additions & 12 deletions samples/xgboost-spark/README.md
Original file line number Diff line number Diff line change
@@ -1,28 +1,34 @@
## Overview
The pipeline creates XGBoost models on structured data with CSV format. Both classification and regression are supported.

The pipeline starts by creating an Google DataProc cluster, and then run analysis, transormation, distributed training and
The `xgboost-training-cm.py` pipeline creates XGBoost models on structured data in CSV format. Both classification and regression are supported.

The pipeline starts by creating an Google DataProc cluster, and then running analysis, transformation, distributed training and
prediction in the created cluster. Then a single node confusion-matrix aggregator is used (for classification case) to
provide frontend the confusion matrix data. At the end, a delete cluster operation runs to destroy the cluster it creates
in the beginning. The delete cluster operation is used as an exit handler, meaning it will run regardless the pipeline fails
provide the confusion matrix data to the front end. Finally, a delete cluster operation runs to destroy the cluster it creates
sarahmaddox marked this conversation as resolved.
Show resolved Hide resolved
in the beginning. The delete cluster operation is used as an exit handler, meaning it will run regardless of whether the pipeline fails
or not.

## Requirements
Preprocessing uses Google Cloud DataProc. So the [DataProc API](https://cloud.google.com/endpoints/docs/openapi/enable-api) needs to be enabled for the given project.

Preprocessing uses Google Cloud DataProc. Therefore, you must enable the [DataProc API](https://cloud.google.com/endpoints/docs/openapi/enable-api) for the given GCP project.

## Compile
Follow [README.md](https://github.com/kubeflow/pipelines/blob/master/samples/README.md) to install the compiler and
compile your sample python into workflow yaml.

Follow the guide to [building a pipeline](https://github.com/kubeflow/pipelines/wiki/Build-a-Pipeline) to install the Kubeflow Pipelines SDK and compile the sample Python into a workflow specification. The specification takes the form of a YAML file compressed into a `.tar.gz` file.

## Deploy
Open the ML pipeline UI. Create a new pipeline, and then upload the compiled YAML file as a new pipeline template.

Open the Kubeflow pipelines UI. Create a new pipeline, and then upload the compiled specification (`.tar.gz` file) as a new pipeline template.

## Run
Most arguments come with default values. Only "output" and "project" need to be filled always. "output" is a Google Storage path which holds
pipeline run results. Note that each pipeline run will create a unique directory under output so it will not override previous results. "project"
is a GCP project.

## Components Source
Most arguments come with default values. Only `output` and `project` need to be filled always.

* `output` is a Google Storage path which holds
pipeline run results. Note that each pipeline run will create a unique directory under `output` so it will not override previous results.
* `project` is a GCP project.

## Components source

Create Cluster:
[source code](https://github.com/kubeflow/pipelines/tree/master/components/dataproc/xgboost/create_cluster)
Expand Down