Skip to content

Latest commit

 

History

History

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 

Batch processing with PySpark

Python SDKMan PySpark uv Docker

License

PySpark Playground for NY Taxi Tripdata Batch Processing Pipeline

Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 3.5.3
      /_/

Using Python version 3.11.11 (main, Jan 14 2025 23:36:41)
Spark context Web UI available at http://192.168.15.29:4040
Spark context available as 'sc' (master = local[*], app id = local-1738438879580).
SparkSession available as 'spark'.

Getting Started

1. Install JDK 17 or 11, Spark 3.5.x, and Hadoop with SDKMan:

sdk i java 17.0.13-librca
sdk i spark 3.5.3
sdk i hadoop 3.3.6

2. Install dependencies from pyproject.toml and activate the created virtualenv:

uv sync && source .venv/bin/activate

3. (Optional) Install pre-commit:

brew install pre-commit

# From root folder where `.pre-commit-config.yaml` is located, run:
pre-commit install

4. Spin up the Spark Cluster with:

docker compose -f ../compose.yaml up -d

TODO's:

  • PEP-517: Packaging and dependency management with uv
  • Code format/lint with Ruff
  • Set up a Jupyter Playground for PySpark
  • Enable Spark to read from Google Cloud Storage
  • Enable Spark to read from AWS S3
  • Submit a PySpark job to the Google Dataproc
  • Deploy Spark to Kubernetes with Helm with minikube or kind
  • Submit a PySpark job to the K8s Spark Cluster