Use DuckDB and SQL to explore a star-schema data warehouse built from the realistic Microsoft Contoso Sales dataset - includes a Python Jupyter notebook for visualization and presentation
This repo is an extension to the DuckDB / SQL / CSV project that only differs by adding Python. For information about installing DuckDB and getting started, see smart-sales-duckdb-sql.
We add Python so we we can present our insights in a Jupyter notebook.
- Log into GitHub. Click on this repo and Copy the template to get it into your GitHub account.
- Click on your new GitHub repo and clone it down.
- Open the repo on your machine in VS Code.
See pro-analytics-01/ for more information. See 01-machine-setup and 02-Project-Initialization. Make sure all the tools, including Python are installed. Follow the steps to create, activate, and install required packages into your local project virtual environment (named .venv). Key steps are summarized in this requirements.txt file, follow steps ABCD.
PowerShell (Windows or Core) - Steps ABC - click Yes when asked
py -m venv .venv
.\.venv\Scripts\activate
py -m pip install --upgrade pip setuptools wheel
py -m pip install --upgrade -r requirements.txt
Mac/Linux/bash/zsh/Git Bash/WSL - Steps ABC - click Yes when asked
python3 -m venv .venv
source .venv/bin/activate
python3 -m pip install --upgrade pip setuptools wheel
python3 -m pip install --upgrade -r requirements.txt
IMPORTANT: You must also do Step D: VS Code - Select Interpreter. See requirements.txt for more.
IMPORTANT: Restart VS Code if it was open while installing DuckDB so it gets the updated system environment.
With the project repository open in VS Code, open a new terminal using the VS Code menu: Terminal / New Terminal.
By default, this opens PowerShell on Windows and zsh or bash on macOS or Linux.
If you're on Windows, you can also select Git Bash from the dropdown if it's installed.
Mac/Linux/bash/zsh/Git Bash/WSL
duckdb dw/sales.duckdb < sql/00_build_dw.sql
PowerShell
Get-Content sql/00_build_dw.sql | duckdb dw/sales.duckdb
Use your terminal to open the duckdb file.
The following command works in all terminal types (thank you DuckDB!)
duckdb dw/sales.duckdb
To exit the DuckDB CLI, either type EITHER one of these or try CTRL D or CTRL Z:
.quit
.exit
First, open the DuckDB CLI and then type the SQL command.
The following command works in all terminal types (thank you DuckDB!)
duckdb dw/sales.duckdb
Once in the CLI, at the D
prompt, type your SQL commands one at a time.
For example, try:
SHOW TABLES;
To execute a saved SQL file from a VS Code terminal (NOT the DuckDB CLI):
duckdb dw/sales.duckdb < sql/11_tables_show_list.sql
In PowerShell use:
Get-Content sql/11_tables_show_list.sql | duckdb dw/sales.duckdb
Source: Kaggle (448 MB Full Clean Dataset): https://www.kaggle.com/datasets/bhanuthakurr/cleaned-contoso-dataset
See: schema.txt to get an overview of the data warehouse.