Name		Name	Last commit message	Last commit date
parent directory ..
notebooks		notebooks
README.md		README.md
fhv_zones_gcs.py		fhv_zones_gcs.py
pyproject.toml		pyproject.toml
uv.lock		uv.lock

README.md

Batch processing with PySpark

PySpark Playground for NY Taxi Tripdata Batch Processing Pipeline

Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 3.5.3
      /_/

Using Python version 3.11.11 (main, Jan 14 2025 23:36:41)
Spark context Web UI available at http://192.168.15.29:4040
Spark context available as 'sc' (master = local[*], app id = local-1738438879580).
SparkSession available as 'spark'.

Getting Started

1. Install JDK 17 or 11, Spark 3.5.x, and Hadoop with SDKMan:

sdk i java 17.0.13-librca
sdk i spark 3.5.3
sdk i hadoop 3.3.6

2. Install dependencies from pyproject.toml and activate the created virtualenv:

uv sync && source .venv/bin/activate

3. (Optional) Install pre-commit:

brew install pre-commit

# From root folder where `.pre-commit-config.yaml` is located, run:
pre-commit install

4. Spin up the Spark Cluster with:

docker compose -f ../compose.yaml up -d

TODO's:

PEP-517: Packaging and dependency management with uv
Code format/lint with Ruff
Set up a Jupyter Playground for PySpark
Enable Spark to read from Google Cloud Storage
Enable Spark to read from AWS S3
Submit a PySpark job to the Google Dataproc
Deploy Spark to Kubernetes with Helm with minikube or kind
Submit a PySpark job to the K8s Spark Cluster

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pyspark

pyspark

README.md

Batch processing with PySpark

Getting Started

TODO's:

Files

pyspark

Directory actions

More options

Directory actions

More options

Latest commit

History

pyspark

Folders and files

parent directory

README.md

Batch processing with PySpark

Getting Started

TODO's: