Apache PySpark Custom Data Source Template

This repository provides a template for creating a custom data source for Apache PySpark. It is designed to help developers extend PySpark’s data source API to support custom data ingestion and storage mechanisms.

Motivation

When developing custom PySpark data sources, I encountered several challenges that made the development process frustrating:

Environment Setup Complexity: Setting up a development environment for PySpark data source development was unnecessarily complex, with multiple dependencies and version conflicts.
Test Data Management: Managing test data and maintaining consistent test environments across different machines was challenging.
Debugging Issues: The default setup made it difficult to debug custom data source code effectively, especially when dealing with Spark's distributed nature.
Documentation Gaps: Existing documentation for custom data source development was scattered and often incomplete.

This template repository aims to solve these pain points and provide a streamlined development experience.

Features

Pre-configured development environment
Ready-to-use test infrastructure
Example implementation
Automated tests setup
Debug-friendly configuration

Getting Started

Follow these steps to set up and use this repository:

Prerequisites

Docker
Visual Studio Code
Python 3.11

Creating a Repository from This Template

To create a new repository based on this template:

Go to the GitHub repository.
Click the Use this template button.
Select Create a new repository.
Choose a repository name, visibility (public or private), and click Create repository from template.

Clone your new repository:

git clone https://github.com/your-username/your-new-repository.git
cd your-new-repository

Setup

Open the repository in Visual Studio Code:
```
code .
```
Build and start the development container:

Open the command palette (Ctrl+Shift+P) and select Remote-Containers: Reopen in Container.
Initialize the environment:

The environment will be initialized automatically by running the init-env.sh script defined in the devcontainer.json file.

Project Structure

The project follows this structure:

.
├── src/
│   ├── fake_source/         # Default fake data source implementation
│   │   ├── __init__.py
│   │   ├── source.py        # Implementation of the fake data source
│   │   ├── schema.py        # Schema definitions (if applicable)
│   │   └── utils.py         # Helper functions (if needed)
│   ├── tests/               # Unit tests for the custom data source
│   │   ├── __init__.py
│   │   ├── test_source.py   # Tests for the data source
│   │   └── conftest.py      # Test configuration and fixtures
├── .devcontainer/           # Development container setup files
│   ├── Dockerfile
│   ├── devcontainer.json
├── |── scripts
├── |   ├── init-env.sh              # Initialization script for setting up the environment
├── pyproject.toml           # Project dependencies and build system configuration
├── README.md                # Project documentation
├── LICENSE                  # License file

Usage

By default, this template includes a fake data source that generates mock data. You can use it as-is or replace it with your own implementation.

Register the custom data source:

from pyspark.sql import SparkSession
from fake_source.source import FakeDataSource

spark = SparkSession.builder.getOrCreate()
spark.dataSource.register(FakeDataSource)

Read data using the custom data source:

df = spark.read.format("fake").load()
df.show()

Run tests:
```
pytest
```

Customization

To replace the fake data source with your own:

Rename the package folder:

mv src/fake_source src/your_datasource_name

Update imports in source.py and other files:

from your_datasource_name.source import CustomDataSource

Update pyproject.toml to reflect the new package name.
Modify the schema and options in source.py to fit your use case.

References

Microsoft Learn - PySpark custom data sources

License

This project is licensed under the MIT License - see the LICENSE file for details.

Contact

For issues and questions, please use the GitHub Issues section.

Need Help Setting Up a Data Intelligence Platform with Databricks?

If you need expert guidance on setting up a modern data intelligence platform using Databricks, we can help. Our consultancy specializes in:

Custom data source development for Databricks and Apache Spark
Optimizing ETL pipelines for performance and scalability
Data governance and security using Unity Catalog
Building ML & AI solutions on Databricks

🚀 Contact us for a consultation and take your data platform to the next level.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
.devcontainer		.devcontainer
.vscode		.vscode
src/source_faker		src/source_faker
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.md		README.md
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Apache PySpark Custom Data Source Template

Motivation

Features

Getting Started

Prerequisites

Creating a Repository from This Template

Setup

Project Structure

Usage

Customization

References

License

Contact

Need Help Setting Up a Data Intelligence Platform with Databricks?

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

geekwhocodes/pyspark-custom-datasource-template

Folders and files

Latest commit

History

Repository files navigation

Apache PySpark Custom Data Source Template

Motivation

Features

Getting Started

Prerequisites

Creating a Repository from This Template

Setup

Project Structure

Usage

Customization

References

License

Contact

Need Help Setting Up a Data Intelligence Platform with Databricks?

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages