This project uses the Olist dataset to analyze sales and customer behavior from a Brazilian e-commerce platform
The Olist Dataset is an open source database from a Brazilian e-commerce platform that connects small businesses with customers from all over the country. It contains data on customer orders, products, sellers, and reviews, offering a complete view of the shopping experience.
The following image is showing how the different tables are related.
-
Orders: Information about each order, including:
order_id
: Unique identifier for each order.order_purchase_timestamp
: The date and time when the order was placed.order_delivered_customer_date
: Delivery date of the order.
-
Order Items: Details on the products within each order:
order_id
: Order identifier linking to the Orders tableproduct_id:
Product identifierprice
: Price of each product within the orderfreight_value
: Cost of freight for each product
-
Customers: Customer demographic information:
customer_id
: Unique customer identifiercustomer_city
: City of the customercustomer_state
: State of the customer
-
Products: Product details:
product_id
: Unique identifier for each productproduct_category_name
: Category of the product
-
Reviews: Customer feedback on orders:
order_id
: Order identifier linking to the Orders tablereview_score
: Rating given by the customer (1-5)
-
Sellers: Information about the sellers on the platform:
seller_id
: Unique identifier for each sellerseller_city
, seller_state: Seller location data
The Olist dataset can be accessed on Kaggle here
In order to facilitate the analysis, the first step is clean and prepare the data to ensure that:
- The data is consistent and easy to analyze.
- Only relevant columns are retained for improved performance.
- Key fields for grouping and analysis are standardized and cleaned.
-
Orders Table:
- Changed the data type of purchase date columns from
timestamp
todate
. - Added separate columns for purchase month and purchase year to facilitate time-based analysis.
- Changed the data type of purchase date columns from
-
Order Items Table:
- Created a new column for the total value of each order (
price + freight_value
).
- Created a new column for the total value of each order (
-
Products Table:
- Created a cleaned version of the table containing:
product_id
and category names in English, replacingNULL
values with valid translations in order to ensure consistency in category names.
- Excluded irrelevant columns such as
product_name_length
,product_description_length
, and product dimensions, as they are not pertinent to this analysis. This optimizes the dataset for faster processing.
- Created a cleaned version of the table containing:
-
Customers Table:
- Added a new column for the region of the customer, based on the state. Simplified analysis by grouping states into the 5 geographical regions of Brazil.
For the full SQL queries used in this process, see Data Cleaning Queries
The exploratory analysis focuses on understanding the dataset through key metrics and visualizations. The following analyses were conducted:
-
Total Sales:
- Measured by total revenue, items sold and number of orders placed. This provides an overview of the business.
-
Average Order Value:
- Calculated the average value of each order to understand customer spending patterns.
-
Total Sales per Month and Year:
- Analyzed sales trends over time by breaking down total revenue by month and year.
-
Customer Distribution:
- Explored customer distribution by region and state to identify geographic trends.
-
Orders by Payment Type:
- Examined the distribution of orders based on the payment method used.
-
Orders by Status:
- Analyzed the count of orders by status, such as delivered, shipped, or canceled.
Each analysis provides insights into customer behavior, sales trends, and operational performance.
SQL queries used for this part of the analysis Here