This repo contains only the skeleton for running a spark standalone cluster extracted from this repo.
You can run the spark standalone cluster by running:
make run
or with 3 workers using:
make run-scaled
You can submit Python jobs with the command:
make submit app=dir/relative/to/spark_apps/dir
e.g. if you have ex6.py
in your spark_apps folder:
make submit app=ex6.py
There are a number of commands to build the standalone cluster, you should check the Makefile to see them all. But the simplest one is:
make build
The master node can be accessed on:
localhost:9090
.
The spark history server is accessible through:
localhost:18080
.
Since we are running the spark cluster on docker, the worker related links do not work on the UI. To fix this I created a generate-docker-compose script that generates the docker compose file (called docker-compose.generated.yml) with the desired number of workers where each worker has assigned and exposed port number.
To bring up this cluster, you can just run:
make run-generated
By default, the command will launch a Spark cluster with a master, history server and 3 worker nodes.
After some time, I decided to add a jupyterlab service. For reference see this GitHub repo.
Jupyterlab will run on port 8888. There is a small example notebook on how to get started.