PySpark Run Job

A configurable Python script to upload and run a simple PySpark job on Google Cloud Platform.

Creates a Google Storage bucket
Uploads a data file
Uploads a PySpark job file
Creates a Hadoop Cluster on Dataproc
Creates a PySpark job
Deletes the Hadoop Cluster

Getting started

Install GCP client libraries

pip install google-cloud-storage google-cloud-dataproc

Set up credentials for GCP

https://cloud.google.com/storage/docs/reference/libraries#setting_up_authentication

Follow the guide to create credentials and download them in a json file
Move the json file to a location you'll remember. For example C:/big_data/
Rename the file to something short and sweet e.g. gcp_auth.json
Set an environment variable with the json file location using a terminal

e.g. for Windows

$env:GOOGLE_APPLICATION_CREDENTIALS="C:\big_data\gcp_auth.json"

e.g. for Mac / Unix

export GOOGLE_APPLICATION_CREDENTIALS="/home/username/big_data/gcp_auth.json"

Note that this variable only exists within the terminal where you executed the command. You'll need to repeat this step each time you open a new terminal. If you are working on your own laptop you can set the environment variable permanently.

Running your job

Copy

Copy settings.py and run_job.py into your project folder, where you have the data and your Python program.

Settings

Update settings.py for the specific job.

BUCKET_NAME = 'your bucket name'
DATA_FILENAME = 'local file containing data for the job'
CODE_FILENAME = 'local python file containing logic for the job'

PROJECT_ID = 'your project ID on Google Cloud Platform'
REGION = 'region where you want to run the job'
CLUSTER_NAME = 'a name for the Hadoop cluster'

Run the job

In the terminal (make sure that GOOGLE_APPLICATION_CREDENTIALS has been set)

python run_job.py

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
README.md		README.md
example_data.txt		example_data.txt
example_job.py		example_job.py
run_job.py		run_job.py
settings.py		settings.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PySpark Run Job

Getting started

Install GCP client libraries

Set up credentials for GCP

Running your job

Copy

Settings

Run the job

About

Releases

Packages

Languages

neiloconnor/pyspark-runjob

Folders and files

Latest commit

History

Repository files navigation

PySpark Run Job

Getting started

Install GCP client libraries

Set up credentials for GCP

Running your job

Copy

Settings

Run the job

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages