A configurable Python script to upload and run a simple PySpark job on Google Cloud Platform.
- Creates a Google Storage bucket
- Uploads a data file
- Uploads a PySpark job file
- Creates a Hadoop Cluster on Dataproc
- Creates a PySpark job
- Deletes the Hadoop Cluster
pip install google-cloud-storage google-cloud-dataproc
https://cloud.google.com/storage/docs/reference/libraries#setting_up_authentication
- Follow the guide to create credentials and download them in a json file
- Move the json file to a location you'll remember. For example
C:/big_data/
- Rename the file to something short and sweet e.g.
gcp_auth.json
- Set an environment variable with the json file location using a terminal
e.g. for Windows
$env:GOOGLE_APPLICATION_CREDENTIALS="C:\big_data\gcp_auth.json"
e.g. for Mac / Unix
export GOOGLE_APPLICATION_CREDENTIALS="/home/username/big_data/gcp_auth.json"
Note that this variable only exists within the terminal where you executed the command. You'll need to repeat this step each time you open a new terminal. If you are working on your own laptop you can set the environment variable permanently.
Copy settings.py
and run_job.py
into your project folder, where you have the data and your Python program.
Update settings.py
for the specific job.
BUCKET_NAME = 'your bucket name'
DATA_FILENAME = 'local file containing data for the job'
CODE_FILENAME = 'local python file containing logic for the job'
PROJECT_ID = 'your project ID on Google Cloud Platform'
REGION = 'region where you want to run the job'
CLUSTER_NAME = 'a name for the Hadoop cluster'
In the terminal (make sure that GOOGLE_APPLICATION_CREDENTIALS
has been set)
python run_job.py