PySpark Playground for NY Taxi Tripdata Batch Processing Pipeline
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 3.5.3
/_/
Using Python version 3.11.11 (main, Jan 14 2025 23:36:41)
Spark context Web UI available at http://192.168.15.29:4040
Spark context available as 'sc' (master = local[*], app id = local-1738438879580).
SparkSession available as 'spark'.
1. Install JDK 17 or 11, Spark 3.5.x, and Hadoop with SDKMan:
sdk i java 17.0.13-librca
sdk i spark 3.5.3
sdk i hadoop 3.3.6
2. Install dependencies from pyproject.toml and activate the created virtualenv:
uv sync && source .venv/bin/activate
3. (Optional) Install pre-commit:
brew install pre-commit
# From root folder where `.pre-commit-config.yaml` is located, run:
pre-commit install
4. Spin up the Spark Cluster with:
docker compose -f ../compose.yaml up -d
- PEP-517: Packaging and dependency management with
uv
- Code format/lint with Ruff
- Set up a Jupyter Playground for PySpark
- Enable Spark to read from Google Cloud Storage
- Enable Spark to read from AWS S3
- Submit a PySpark job to the Google Dataproc
- Deploy Spark to Kubernetes with Helm with minikube or kind
- Submit a PySpark job to the K8s Spark Cluster