Skip to content

Commit

Permalink
Dataproc GCS sample plus doc touchups (#1151)
Browse files Browse the repository at this point in the history
  • Loading branch information
waprin authored and Jon Wayne Parrott committed Nov 15, 2017
1 parent bdb0b65 commit be9f9d7
Show file tree
Hide file tree
Showing 3 changed files with 74 additions and 27 deletions.
69 changes: 43 additions & 26 deletions dataproc/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,15 +2,23 @@

Sample command-line programs for interacting with the Cloud Dataproc API.


Please see [the tutorial on the using the Dataproc API with the Python client
library](https://cloud.google.com/dataproc/docs/tutorials/python-library-example)
for more information.

Note that while this sample demonstrates interacting with Dataproc via the API, the functionality
demonstrated here could also be accomplished using the Cloud Console or the gcloud CLI.

`list_clusters.py` is a simple command-line program to demonstrate connecting to the
Dataproc API and listing the clusters in a region

`create_cluster_and_submit_job.py` demonstrates how to create a cluster, submit the
`create_cluster_and_submit_job.py` demonstrates how to create a cluster, submit the
`pyspark_sort.py` job, download the output from Google Cloud Storage, and output the result.

`pyspark_sort.py_gcs` is the asme as `pyspark_sort.py` but demonstrates
reading from a GCS bucket.

## Prerequisites to run locally:

* [pip](https://pypi.python.org/pypi/pip)
Expand All @@ -19,50 +27,59 @@ Go to the [Google Cloud Console](https://console.cloud.google.com).

Under API Manager, search for the Google Cloud Dataproc API and enable it.

## Set Up Your Local Dev Environment

# Set Up Your Local Dev Environment
To install, run the following commands. If you want to use [virtualenv](https://virtualenv.readthedocs.org/en/latest/)
(recommended), run the commands within a virtualenv.

* pip install -r requirements.txt

Create local credentials by running the following command and following the oauth2 flow:
## Authentication

Please see the [Google cloud authentication guide](https://cloud.google.com/docs/authentication/).
The recommended approach to running these samples is a Service Account with a JSON key.

## Environment Variables

gcloud beta auth application-default login
Set the following environment variables:

GOOGLE_CLOUD_PROJECT=your-project-id
REGION=us-central1 # or your region
CLUSTER_NAME=waprin-spark7
ZONE=us-central1-b

## Running the samples

To run list_clusters.py:

python list_clusters.py <YOUR-PROJECT-ID> --region=us-central1
python list_clusters.py $GOOGLE_CLOUD_PROJECT --region=$REGION

`submit_job_to_cluster.py` can create the Dataproc cluster, or use an existing one.
If you'd like to create a cluster ahead of time, either use the
[Cloud Console](console.cloud.google.com) or run:

To run submit_job_to_cluster.py, first create a GCS bucket, from the Cloud Console or with
gsutil:
gcloud dataproc clusters create your-cluster-name

gsutil mb gs://<your-input-bucket-name>

Then, if you want to rely on an existing cluster, run:

python submit_job_to_cluster.py --project_id=<your-project-id> --zone=us-central1-b --cluster_name=testcluster --gcs_bucket=<your-input-bucket-name>

Otherwise, if you want the script to create a new cluster for you:
To run submit_job_to_cluster.py, first create a GCS bucket for Dataproc to stage files, from the Cloud Console or with
gsutil:

python submit_job_to_cluster.py --project_id=<your-project-id> --zone=us-central1-b --cluster_name=testcluster --gcs_bucket=<your-input-bucket-name> --create_new_cluster
gsutil mb gs://<your-staging-bucket-name>

Set the environment variable's name:

This will setup a cluster, upload the PySpark file, submit the job, print the result, then
delete the cluster.
BUCKET=your-staging-bucket
CLUSTER=your-cluster-name

You can optionally specify a `--pyspark_file` argument to change from the default
`pyspark_sort.py` included in this script to a new script.
Then, if you want to rely on an existing cluster, run:

## Running on GCE, GAE, or other environments
python submit_job_to_cluster.py --project_id=$GOOGLE_CLOUD_PROJECT --zone=us-central1-b --cluster_name=$CLUSTER --gcs_bucket=$BUCKET

On Google App Engine, the credentials should be found automatically.
Otherwise, if you want the script to create a new cluster for you:

On Google Compute Engine, the credentials should be found automatically, but require that
you create the instance with the correct scopes.
python submit_job_to_cluster.py --project_id=$GOOGLE_CLOUD_PROJECT --zone=us-central1-b --cluster_name=$CLUSTER --gcs_bucket=$BUCKET --create_new_cluster

gcloud compute instances create --scopes="https://www.googleapis.com/auth/cloud-platform,https://www.googleapis.com/auth/compute,https://www.googleapis.com/auth/compute.readonly" test-instance
This will setup a cluster, upload the PySpark file, submit the job, print the result, then
delete the cluster.

If you did not create the instance with the right scopes, you can still upload a JSON service
account and set `GOOGLE_APPLICATION_CREDENTIALS`. See [Google Application Default Credentials](https://developers.google.com/identity/protocols/application-default-credentials) for more details.
You can optionally specify a `--pyspark_file` argument to change from the default
`pyspark_sort.py` included in this script to a new script.
2 changes: 1 addition & 1 deletion dataproc/pyspark_sort.py
Original file line number Diff line number Diff line change
Expand Up @@ -24,5 +24,5 @@
sc = pyspark.SparkContext()
rdd = sc.parallelize(['Hello,', 'world!', 'dog', 'elephant', 'panther'])
words = sorted(rdd.collect())
print words
print(words)
# [END pyspark]
30 changes: 30 additions & 0 deletions dataproc/pyspark_sort_gcs.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
#!/usr/bin/env python
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

""" Sample pyspark script to be uploaded to Cloud Storage and run on
Cloud Dataproc.
Note this file is not intended to be run directly, but run inside a PySpark
environment.
This file demonstrates how to read from a GCS bucket. See README.md for more
information.
"""

# [START pyspark]
import pyspark

sc = pyspark.SparkContext()
rdd = sc.textFile('gs://path-to-your-GCS-file')
print(sorted(rdd.collect()))
# [END pyspark]

0 comments on commit be9f9d7

Please sign in to comment.