This project implements a comprehensive microservices-based architecture for stream processing using Apache Kafka and Apache Spark. The microservice processes JSON-based data streams in real-time, enabling analysis and transformations, such as salary aggregation and gender distribution analysis. Key highlights of the project include containerization with Docker, deployment with Kubernetes, integration of Serverless Framework for Infrastructure as Code (IaC), and a robust CI/CD pipeline to ensure code quality and automated testing.
The main functionalities of the project are:
- Real-time ingestion and transformation of JSON data using Kafka and Spark.
- Statistical analysis of data streams, such as calculating average salaries and counting records by gender or birth month.
- Flask-based RESTful API that interacts with the Spark data pipeline.
- Performance benchmarking with Locust to evaluate scalability and reliability.
The microservices are optimized for portability, scalability, and monitoring, with emphasis on distributed data pipelines for high-throughput applications.
- Real-time Data Stream Processing: Uses Apache Spark and Kafka to process and analyze large-scale streaming data.
- Microservices Architecture: Developed with Flask for seamless interaction with data pipelines.
- Containerization: Dockerized microservices ensure portability and environment consistency.
- Infrastructure as Code: Utilizes Serverless Framework for cloud resource provisioning and management.
- Load Testing: Comprehensive performance benchmarking with Locust.
- Continuous Integration/Continuous Delivery: Automated CI/CD pipeline ensures deployment reliability.
- Install Docker:
- Follow the Docker installation guide.
- Install Kubernetes:
- Follow the Kubernetes installation guide.
- Install Helm:
- Kafka: Bitnami Kafka Chart
- Spark: Bitnami Spark Chart
- Install Locust for Load Testing:
pip install locust
- Start Minikube:
minikube start
- Deploy Kafka and Spark using Helm:
helm repo add bitnami https://charts.bitnami.com/bitnami helm install kafka bitnami/kafka helm install spark bitnami/spark
- Verify Deployment:
kubectl get all
- Build the Docker Image:
docker build -t microservice .
- Run the Docker Container:
docker run -p 5000:5000 microservice
- Test the Flask API:
curl -X POST http://localhost:5000/process \ -H "Content-Type: application/json" \ -d '[{"id": 1, "gender": "M", "salary": 5000}, {"id": 2, "gender": "F", "salary": 6000}]'
Test Details:
- Users: 100 concurrent users.
- Ramp-up Rate: 10 users per second.
- Test Duration: 1 minute.
Results:
- Total Requests: 634
- Average Latency: 8091ms
- Minimum Latency: 4912ms
- Maximum Latency: 16845ms
- Median Latency: 7600ms
Key Issues Identified:
- High Latency: The average latency remains above 8000ms, with some requests exceeding 16000ms.
- Limited Throughput: The system failed to achieve the target of 10,000 requests per second. The peak throughput was approximately 15.6 requests/second.
Potential Bottlenecks:
- Spark Job Overhead: Processing large JSON objects in Spark introduces additional latency.
- Kafka Configuration: Suboptimal broker settings may limit message throughput.
- Flask API: Single-threaded Flask may not handle high concurrency effectively.
Future Improvements:
- Optimize Spark jobs by caching intermediate results or partitioning data more effectively.
- Improve Kafka broker configurations (e.g., increase replication factor and partitions).
- Replace Flask with a more scalable framework like FastAPI or deploy Flask with Gunicorn.
- Implement distributed caching (e.g., Redis) to reduce repeated computations.
This project utilizes the Serverless Framework for managing cloud infrastructure resources. The framework simplifies the provisioning of services like Kafka topics and AWS Lambda functions, ensuring consistent deployments.
Metric | Value |
---|---|
Total Requests | 634 |
Average Latency | 8091ms |
Minimum Latency | 4912ms |
Maximum Latency | 16845ms |
Median Latency | 7600ms |
Error Rate | 0% (0 failures) |
Click here to view the demo video.
- GitHub Copilot:
- Suggested efficient ways to integrate Spark and Kafka.
- Streamlined the development of RESTful API endpoints.
- TabNine:
- Provided intelligent autocompletions for Spark transformations and Flask code.
- Enhanced the quality of code structure and SQL-like operations.
Feature/Code Component | Rubric Requirement |
---|---|
Flask API (src/app.py) | Microservice Implementation |
Dockerfile | Containerization with Distroless |
Logging (src/app.py) | Use of Logging |
Locust Load Testing | Successful Load Test |
Spark Scripts (e.g., count_gender.py) | Data Engineering |
Serverless Framework | Infrastructure as Code |
GitHub Actions Workflow (.github/workflows/ci.yml) | CI/CD Setup |
README.md | Comprehensive Documentation |
Architectural Diagram | Clear Architecture Representation |
.devcontainer Config | GitHub/GitLab Configurations |
project-root/
│
├── .devcontainer/ # Docker configuration for GitHub Codespaces
│ ├── Dockerfile
│ ├── devcontainer.json
│
├── src/ # Source code for microservices
│ ├── app.py # Flask API for data processing
│ ├── load_test.py # Locust script for load testing
│ ├── count_gender.py # Spark job: Count records by gender
│ ├── count_month.py # Spark job: Count records by birth month
│ ├── final_analysis.py # Spark job: Comprehensive analysis
│ ├── read_emp_data.py # Spark job: Read employee data
│ ├── read_json_from_topic.py # Spark job: Read data from Kafka
│
├── data/ # Sample data for testing
│ ├── emp_data.json
│ ├── json_topic.json
│
├── images/ # Helm chart specifications for Kafka and Spark
│ ├── kafka.yml
│ ├── spark.yml
│
├── requirements.txt # Python dependencies
├── README.md # Project documentation
├── Commands.md # Commands documentation