A real-time data processing pipeline built with Apache Flink that analyzes factory sensor data. This project handles both streaming and batch processing of industrial machine data, with PostgreSQL for storage and Docker for easy deployment.
This project processes sensor data from factory machines in real-time. I built it to learn Apache Flink and explore how to handle both streaming data (live sensor readings) and batch data (historical analysis).
The system simulates factory sensor data and runs two types of analysis:
- Stream processing: Analyzes data as it comes in, detecting anomalies in real-time
- Batch processing: Processes historical data for trends and insights
Everything runs in Docker containers, so you can spin up the entire infrastructure with one command.
- Flink cluster: 1 JobManager + 3 TaskManagers for distributed processing
- PostgreSQL: Stores the processed results
- Data generator: Creates realistic sensor data for testing
- PyFlink jobs: The actual data processing code
- Processes sensor data in real-time as it streams in
- Handles batch processing for historical data analysis
- Detects machine faults based on temperature, vibration, and sound thresholds
- Scales across multiple Flink task managers
- Generates realistic factory sensor data for testing
- Stores results in PostgreSQL for further analysis
- Everything containerized with Docker
The data generator sends sensor readings through a socket connection to Flink. The streaming job continuously aggregates data by machine type and flags potential issues when temperature > 75°C, vibration > 20mm/s, or sound > 80dB.
For historical analysis, the system generates a large CSV file with sensor data, then processes it in batch mode to calculate averages and detect patterns across different machine types.
Both pipelines save their results to PostgreSQL tables for analysis.
- Apache Flink 1.18.1 - Stream and batch processing
- PyFlink - Python API for Flink
- PostgreSQL 15 - Database for results
- Docker & Docker Compose - Containerization
- Python 3.9+ - Data generation and processing
- Pandas & NumPy - Data manipulation
- Docker Desktop or Docker Engine
- Docker Compose
- At least 8GB RAM (recommended)
- About 10GB free disk space
git clone https://github.com/MbarekTech/Flink-sensor-processing.git
cd Flink-sensor-processing
docker-compose up -d
This starts:
- Flink Job Manager (Web UI at http://localhost:8081)
- 3 Flink Task Managers
- PostgreSQL database (port 5435)
- PyFlink client container
- Data generator services
docker exec -it flink-client-pyflink bash
python /flink-job/streaming_sensor_analyzer.py
# In the same container
python /flink-job/batch_sensor_analyzer.py
- Flink Web UI: http://localhost:8081
- Database: localhost:5435, database: flink_db, user: flink_user, password: flink_password
Flink-sensor-processing/
├── docker-compose.yml # Sets up all the containers
├── Dockerfile.client # PyFlink client environment
├── Dockerfile.generator # Data generator environment
├── requirements.txt # Python dependencies
├── simulate_factory_data.py # Generates sensor data
├── demo.ps1 # PowerShell demo script
├── flink-job/ # Processing jobs
│ ├── streaming_sensor_analyzer.py # Real-time processing
│ └── batch_sensor_analyzer.py # Batch processing
├── data/
│ └── factory_sensor_simulator_2040.csv # Template data
├── unused/ # Old files
└── README.md
You can change these environment variables:
Variable | Default | What it does |
---|---|---|
FLINK_HOST |
jobmanager |
Where Flink is running |
FLINK_PORT |
9000 |
Port for streaming data |
MODE |
stream |
Generator mode (stream/batch) |
The system tracks these metrics from factory machines:
- Machine ID and type
- Operating hours, temperature, vibration levels
- Sound levels, power consumption
- Maintenance history and error counts
This could be useful for:
- Detecting machine problems before they cause downtime
- Monitoring factory equipment in real-time
- Analyzing machine performance over time
- Finding patterns in production data
- Optimizing power usage across machines
The system calculates:
- Average temperature, vibration, and sound by machine type
- Power consumption patterns
- How many machines of each type are running
- Count of machines that exceed safety thresholds
Out of memory errors: Give Docker more RAM (8GB minimum)
Port conflicts: Make sure ports 8081, 5435, and 9000 aren't being used
Jobs won't start: Wait for PostgreSQL to finish starting up, then try again
Can't connect to Flink: Check if the cluster is running at http://localhost:8081
# Check if containers are running
docker-compose ps
# See what's happening
docker-compose logs flink-jobmanager
docker-compose logs postgres
# Connect to the database
docker exec -it flink-postgres psql -U flink_user -d flink_db
Feel free to open issues or submit pull requests if you find bugs or have ideas for improvements.
MIT License - see LICENSE for details.
Built with Apache Flink, PostgreSQL, and Docker. Thanks to those communities for making great tools.