This project implements a complex SQL query on a MySQL database involving joins, aggregations, and sorting, as specified by the assignment. The repository includes all necessary files to set up and run the project, as well as explanations for each component and the query functionality.
-
.devcontainer/: Contains configuration files for a development container.
Dockerfile
: Defines the container environment.devcontainer.json
: Configures development container settings.
-
.github/workflows/: Holds the configuration file for CI/CD.
cicd.yml
: Sets up the CI/CD pipeline to test the SQL query and validate project requirements.
-
data/: Folder containing sample CSV files used for testing and development.
GroceryDB_lgFPro.csv
: Sample data file for the grocery database.employees.csv
: Sample data file for employee information.
-
mylib/: Custom Python library for handling data extraction, querying, and transformation.
__init__.py
: Initializes themylib
package.extract.py
: Handles data extraction logic.query.py
: Contains the complex SQL query.transform_load.py
: Processes data transformations and loads it into the database.
-
Root Files:
.gitignore
: Specifies files and directories ignored by Git.Dockerfile
: Builds the project environment.LICENSE
: Project license file.Makefile
: Defines commands for setting up, testing, and running the project.README.md
: Project documentation (this file).main.py
: Main script to run the project.requirements.txt
: Lists required Python packages.setup.sh
: Script for setting up the environment.test_main.py
: Contains tests for the project components.
To run this project, you need:
- Docker
- Python 3.8+
- A MySQL database instance (or any other SQL/NoSQL database, such as DynamoDB, Databricks, or Neo4j, if swapping is needed).
Install dependencies by running:
pip install -r requirements.txt
The complex SQL query is located in mylib/query.py
. It involves:
- Joins: To connect data from multiple tables, such as grocery data and employee information.
- Aggregations: Using functions like
SUM
,AVG
, andCOUNT
to get summarized insights. - Sorting: Ordering the results based on specified columns to present data meaningfully.
This query retrieves information on employee sales performance, total sales per product category, and average sales by region, sorted by the highest-performing categories.
The query is designed to provide insights into:
- Top-performing products and categories based on total sales.
- Sales performance by employee to identify the most effective team members.
- Regional sales trends to guide marketing and logistics decisions.
The query output will include:
- Product categories with the highest total sales.
- Individual employee sales records ranked by performance.
- Summary of average sales across different regions.
For more details on how the query is structured, refer to mylib/query.py
.
The CI/CD pipeline is configured using GitHub Actions and can be found in .github/workflows/cicd.yml
. This pipeline automatically:
- Tests the SQL query for functionality and performance.
- Validates data loading and extraction scripts.
- Ensures code formatting and documentation standards are met.
-
Set up the Database: Populate your MySQL database (or the chosen alternative) with the provided CSV files in the
data
folder. -
Run the Query: Use
main.py
to execute the query and view the results.python main.py
-
Testing: Run tests with the following command:
pytest test_main.py
- SQL Query: Located in
mylib/query.py
. - Explanation: Descriptions provided in the README and inline comments in
query.py
. - CI/CD Pipeline: Configured in
.github/workflows/cicd.yml
. - Documentation: This
README.md
.