This repository contains the code and documentation for a data ETL & analytics project aimed at optimizing online shopping business strategies. The project focuses on understanding customer behavior and identifying factors contributing to sales revenue during 2019.
In this project, online shopping data in CSV format is utilized, and the ETL process includes the use of Python for data manipulation and Kibana for data visualization. The goal is to generate actionable insights to enhance business strategies.
- Automate the ETL process using Apache Airflow with pipelines scheduled for daily execution at 6:30 AM.
- Load data from a CSV file into PostgreSQL and fetch data from PostgreSQL to save it in a CSV file.
- Clean and preprocess data, saving the cleaned data back as a CSV file.
- Import the cleaned data into Elasticsearch for advanced querying and visualization.
- Validate the data using Great Expectations.
- Process and visualize the data using Kibana.
1. Extract
Data Collection: Gather data from online shopping activities.
2. Transform
- lean and preprocess the data.
- Validate data quality using Great Expectations.
3. Load
Import cleaned data into Elasticsearch for indexing and search capabilities.
4. Analyze
Visualize data using Kibana to extract meaningful insights.
5. Conclusion
Draw conclusions and provide recommendations based on the analysis.
- Apache Airflow: To create and schedule data pipelines.
- Python: For data manipulation and processing.
- PostgreSQL: As a relational database to store and manage data.
- Elasticsearch: To enable fast and scalable data retrieval.
- Kibana: For interactive data visualization and analysis.
- Great Expectations: For data validation to ensure data quality and integrity