This project is an end-to-end data analysis workflow using Walmart sales data. It includes data acquisition, cleaning, transformation, and advanced SQL analysis using PostgreSQL and Python. This project is ideal for data analysts aiming to enhance their skills in PostgreSQL querying, data manipulation with Python, and analytical problem-solving.
- Tools: Visual Studio Code, Python, PostgreSQL
- Task: Create a clean and structured workspace with organized folders for scripts, notebooks, and data files.
-
Steps:
-
Download your API token from Kaggle.
-
Save
kaggle.jsonin your system’s.kagglefolder. -
Use the command:
kaggle datasets download -d
-
- Source: Walmart Sales Dataset
- Storage: Save the dataset inside the
data/folder.
Install Python libraries:
pip install pandas numpy sqlalchemy psycopg2
Load the dataset into a Pandas DataFrame for further processing.
Use .info(), .describe(), and .head() to understand the structure, data types, and potential issues in the dataset.
- Remove Duplicates: Eliminate duplicate entries.
- Handle Missing Values: Fill or drop based on context.
- Fix Data Types: Convert to appropriate formats (
datetime,float, etc.). - Format Currency: Clean up currency fields using
.replace()and other string operations.
-
Add a new column:
python code - df["total_amount"] = df["unit_price"] * df["quantity"]
-
This makes revenue analysis and SQL aggregations easier.
- Use
SQLAlchemyandpsycopg2to connect to PostgreSQL. - Create a new table and insert the cleaned data.
- Run verification queries to confirm data integrity.
Use PostgreSQL to write and execute SQL queries that answer business questions such as:
- Revenue trends by branch and product category.
- Top-performing cities and payment methods.
- Customer behavior patterns and peak shopping times.
- Profit margins by location and category.
-
Document the process using Markdown and/or Jupyter Notebooks.
-
Publish to GitHub with the following:
README.md- SQL query scripts
- Notebooks and/or
.pyscripts - Instructions to obtain and set up the data
-
Python: 3.8+
-
Database: PostgreSQL
-
Libraries:
pandas,numpy,sqlalchemy,psycopg2
-
Kaggle API key (for data download)
|-- data/ # Raw and cleaned data files
|-- sql_queries/ # PostgreSQL scripts
|-- notebooks/ # Jupyter/Python analysis notebooks
|-- main.py # Script for loading and processing data
|-- requirements.txt # Python dependencies
|-- README.md # Project documentation
- Top Branches: Identify branches with highest revenue and best-selling categories.
- Customer Preferences: Most used payment methods and peak buying times.
- Profitability: Evaluate which categories and cities are most profitable.
- Add interactive dashboards using Power BI or Tableau.
- Automate the pipeline for live data ingestion.
- Integrate external data sources (e.g., customer feedback or inventory data).
- Dataset: Walmart Sales Dataset from Kaggle
- Inspired by: Real-world business analytics use cases in retail