This repository provides a template for creating a custom data source for Apache PySpark. It is designed to help developers extend PySpark’s data source API to support custom data ingestion and storage mechanisms.
When developing custom PySpark data sources, I encountered several challenges that made the development process frustrating:
-
Environment Setup Complexity: Setting up a development environment for PySpark data source development was unnecessarily complex, with multiple dependencies and version conflicts.
-
Test Data Management: Managing test data and maintaining consistent test environments across different machines was challenging.
-
Debugging Issues: The default setup made it difficult to debug custom data source code effectively, especially when dealing with Spark's distributed nature.
-
Documentation Gaps: Existing documentation for custom data source development was scattered and often incomplete.
This template repository aims to solve these pain points and provide a streamlined development experience.
- Pre-configured development environment
- Ready-to-use test infrastructure
- Example implementation
- Automated tests setup
- Debug-friendly configuration
Follow these steps to set up and use this repository:
- Docker
- Visual Studio Code
- Python 3.11
To create a new repository based on this template:
-
Go to the GitHub repository.
-
Click the Use this template button.
-
Select Create a new repository.
-
Choose a repository name, visibility (public or private), and click Create repository from template.
-
Clone your new repository:
git clone https://github.com/your-username/your-new-repository.git cd your-new-repository
-
Open the repository in Visual Studio Code:
code .
-
Build and start the development container:
Open the command palette (Ctrl+Shift+P) and select
Remote-Containers: Reopen in Container
. -
Initialize the environment:
The environment will be initialized automatically by running the
init-env.sh
script defined in thedevcontainer.json
file.
The project follows this structure:
.
├── src/
│ ├── fake_source/ # Default fake data source implementation
│ │ ├── __init__.py
│ │ ├── source.py # Implementation of the fake data source
│ │ ├── schema.py # Schema definitions (if applicable)
│ │ └── utils.py # Helper functions (if needed)
│ ├── tests/ # Unit tests for the custom data source
│ │ ├── __init__.py
│ │ ├── test_source.py # Tests for the data source
│ │ └── conftest.py # Test configuration and fixtures
├── .devcontainer/ # Development container setup files
│ ├── Dockerfile
│ ├── devcontainer.json
├── |── scripts
├── | ├── init-env.sh # Initialization script for setting up the environment
├── pyproject.toml # Project dependencies and build system configuration
├── README.md # Project documentation
├── LICENSE # License file
By default, this template includes a fake data source that generates mock data. You can use it as-is or replace it with your own implementation.
-
Register the custom data source:
from pyspark.sql import SparkSession from fake_source.source import FakeDataSource spark = SparkSession.builder.getOrCreate() spark.dataSource.register(FakeDataSource)
-
Read data using the custom data source:
df = spark.read.format("fake").load() df.show()
-
Run tests:
pytest
To replace the fake data source with your own:
-
Rename the package folder:
mv src/fake_source src/your_datasource_name
-
Update imports in
source.py
and other files:from your_datasource_name.source import CustomDataSource
-
Update
pyproject.toml
to reflect the new package name. -
Modify the schema and options in
source.py
to fit your use case.
This project is licensed under the MIT License - see the LICENSE file for details.
For issues and questions, please use the GitHub Issues section.
If you need expert guidance on setting up a modern data intelligence platform using Databricks, we can help. Our consultancy specializes in:
- Custom data source development for Databricks and Apache Spark
- Optimizing ETL pipelines for performance and scalability
- Data governance and security using Unity Catalog
- Building ML & AI solutions on Databricks
🚀 Contact us for a consultation and take your data platform to the next level.