Skip to content

Production-grade ETL pipeline for cardiovascular risk analytics with HIPAA-compliant data processing, automated quality validation, and enterprise BI dashboards.

Notifications You must be signed in to change notification settings

Fshahnaj/CardioInsight-AI

Repository files navigation

CardioInsight-AI:Clinical Risk Analytics & Dashboard PlatformπŸ©ΊπŸ“Š

Enterprise-grade healthcare analytics platform for cardiovascular risk assessment, built on a Kaggle dataset of ~70,000 patient records.

CardioInsight-AI demonstrates production-style:

  • Data engineering
  • Data quality & validation
  • ML modeling
  • BI dashboarding

All in one cohesive project:

  • HIPAA-style de-identification and clinical feature engineering
  • 20+ data quality checks with JSON + HTML reports
  • dbt + DuckDB star schema and marts for analytics
  • ML pipeline (Logistic Regression + Random Forest) β€” ROC-AUC β‰ˆ 0.79
  • Power BI dashboard with population insights & patient-level risk drilldown

This system mirrors real-world workflows used in hospital analytics teams (Duke Health, UNC Health, CVS Health/Optum, Mayo Clinic, etc.).


1. Project Overview

Business Question

Can we build an end-to-end platform that turns raw cardiovascular measurements into high-quality, explainable risk insights for clinicians and decision makers?

CardioInsight-AI answers this by:

  • Ingesting and de-identifying a publicly available cardiovascular dataset
  • Cleaning, validating, and transforming it into a star schema analytics mart
  • Training ML models to predict cardiovascular events
  • Delivering insights through a two-page Power BI report:
    • Page 1: Population Risk Overview & Segment Analysis
    • Page 2: Patient Risk Explorer & Clinical Drilldown

2. Architecture (Narrative Details)

2.1 Raw Data β†’ De-identification (Python)

  • Read cardio_train.csv (~70K rows)
  • Remove direct identifiers
  • Create patient_id, age_years, age_band, bmi, bmi_band
  • Engineer features:
    • Pulse pressure
    • Hypertension flag
    • Cholesterol/glucose categories
  • Output β†’ data/lake/cardio_deid_data.csv

2.2 Data Quality & Validation

  • Script: data_quality/dq_validators.py
  • Runs:
    • Missingness checks
    • Clinical range checks (age, BP, BMI, height, weight)
    • Logical consistency (ap_hi β‰₯ ap_lo)
    • Uniqueness of patient_id
  • Outputs:
    • data/quality_reports/dq_report.json
    • data/quality_reports/dq_report.html

2.3 Analytics Warehouse (dbt + DuckDB)

  • Warehouse: cardio_warehouse.duckdb
  • dbt models:
    • Staging model β†’ stg_cardioinsight
    • Mart β†’ mart_cardio_risk
  • dbt tests enforce:
    • Not-null constraints
    • Accepted values
    • Uniqueness of patient_id

2.4 Machine Learning Layer

  • Script: ml/models/ml_pipeline.py
  • Loads mart_cardio_risk
  • Performs train/test split
  • Trains two models:
    • Logistic Regression
    • Random Forest
  • Logistic Regression ROC-AUC β‰ˆ 0.79
  • Saves model artifacts to ml/models/artifacts/

2.5 Analytics & BI Layer

  • Script: data/exports/export_mart_to_csv.py
  • Exports BI-ready file:
    • data/processed/mart_cardio_risk.csv
  • Power BI consumes this CSV to power:
    • KPI tiles
    • CVD funnel
    • Segment analyses
    • ML-integrated patient risk explorer

3. System Architecture Diagram

Raw CSV Data
Kaggle dataset (~70K)
➑️ Data Lake
De-identified, cleaned data
(Python ETL)
➑️ DuckDB Warehouse
Star schema marts
(dbt models)
➑️ ML Models
Logistic Regression & Random Forest
➑️ Insights Layer
Power BI KPIs, funnel,
patient explorer

4. Tech Stack

Languages & Tools

  • Python (pandas, numpy, scikit-learn, duckdb)
  • dbt-core + dbt-duckdb
  • Power BI (DAX, M)
  • Git, virtualenv

Key Concepts

  • HIPAA de-identification
  • Data quality validation
  • Star schema modeling
  • ML classification modeling
  • BI visual analytics (clinical context)

5. Repository Structure

CardioInsight-AI/
β”œβ”€ etl/
β”‚  β”œβ”€ hipaa_de_identification.py
β”‚  β”œβ”€ build_warehouse.py
β”‚
β”œβ”€ data/
β”‚  β”œβ”€ raw/
β”‚  β”œβ”€ lake/
β”‚  β”œβ”€ warehouse/
β”‚  └─ exports/export_mart_to_csv.py
β”‚
β”œβ”€ data_quality/
β”‚  β”œβ”€ dq_validators.py
β”‚  └─ quality_reports/
β”‚
β”œβ”€ cardioinsight_dbt/
β”‚  β”œβ”€ dbt_project.yml
β”‚  β”œβ”€ models/staging/
β”‚  β”‚   └─ stg_cardioinsight.sql
β”‚  β”œβ”€ models/marts/
β”‚  β”‚   └─ mart_cardio_risk.sql
β”‚
β”œβ”€ ml/models/
β”‚  β”œβ”€ ml_pipeline.py
β”‚  └─ artifacts/
β”‚
β”œβ”€ dashboards/
β”‚  └─ CardioInsight-AI.pbix
β”‚
β”œβ”€ requirements.txt
└─ README.md

6. How to Run the Project Locally

6.1 Prerequisites

  • Python 3.10+
  • pip or conda
  • Power BI Desktop
  • Git (optional)
  • (Optional) Conda / virtualenv

6.2 Setup

# Clone the repo
git clone https://github.com/<your-username>/CardioInsight-AI.git
cd CardioInsight-AI

# Create and activate a virtual env (recommended)
python -m venv .venv

# Windows:
.venv\Scripts\activate

# macOS/Linux:
# source .venv/bin/activate

# Install dependencies
pip install -r requirements.txt

Download the Kaggle dataset and place it as:

data/raw/cardio_raw_data.csv

6.3 Run ETL + Data Quality

python etl/hipaa_de_identification.py
python etl/build_warehouse.py
python data_quality/dq_validators.py

After this, check:

  • data/lake/cardio_deid_data.csv
  • data/quality_reports/dq_report.json
  • data/quality_reports/dq_report.html

6.4 Run dbt Models

cd cardioinsight_dbt

dbt debug
dbt build --full-refresh

# Or run specific models:
dbt run --select stg_cardioinsight
dbt run --select mart_cardio_risk

6.5 Train ML Models

cd ..
python ml/models/ml_pipeline.py

This will:

  • Load mart_cardio_risk
  • Train Logistic Regression & Random Forest
  • Print ROC-AUC and classification metrics
  • Save model artifacts

6.6 Export Data for Power BI

python data/exports/export_mart_to_csv.py

This writes the BI-ready file:

data/processed/mart_cardio_risk.csv

🧠 Machine Learning Results

  • Logistic Regression AUC: ~0.79
  • Random Forest AUC: ~0.77
  • Best model integrated into Power BI
  • Patient-level predictions include expected vs actual clinical values

πŸ“Š CardioInsight-AI β€” Power BI Dashboard

Clinical Analytics & Patient Risk Explorer

This dashboard visualizes cardiovascular risk insights using:

  • Cleaned & feature-engineered dataset
  • dbt-built clinical risk mart
  • Logistic Regression ML model
  • Patient-level ML risk drilldown
  • Population-level epidemiological patterns

πŸ“„ Page 1 β€” Population Cardiovascular Insights

  • KPI Tiles: Total Patients, CVD Risk %, High-Risk Patients
  • CVD Risk Funnel: Population β†’ CVD Events β†’ Hypertension β†’ High Cholesterol
  • BMI Band distribution
  • Age Band donut
  • Cholesterol & Glucose stacked bars
  • Hypertension distribution by age
  • Filter panel:
    • Age Band
    • BMI Band
    • Cholesterol Category
    • Glucose Category
    • Smoking Status
    • Alcohol Use
    • Activity Level

πŸ“„ Page 2 β€” Patient Risk Explorer

  • Patient selector (drop-down)
  • Patient profile panel
  • Clinical indicators:
    • Hypertension Status
    • ML Risk Flag
    • CVD Observed
    • Pulse Pressure
  • ML Output:
    • Predicted CVD Probability (Gauge)
  • Patient vs Population Comparisons:
    • Systolic BP
    • Diastolic BP
    • BMI

πŸ§ͺ ML Integration

The dashboard uses ML predictions generated by the Python pipeline:

  • Logistic Regression probability
  • High/Moderate/Low Risk classification
  • Combined with clinical thresholds (BP, cholesterol, BMI) β†’ strong, explainable indicators

πŸ“Š Live Dashboard

πŸ‘‰ View the Interactive Power BI Dashboard

🩺 Why This Project Matters

This platform demonstrates:

  • Real-world data engineering
  • Healthcare-grade data cleaning & validation
  • ML model development & deployment
  • BI storytelling with clinical insights
  • Full end-to-end architecture

Aligned with roles in:

  • Healthcare Analytics
  • Biotech
  • Data Engineering
  • Machine Learning Engineering

πŸ“¬ Contact

Email: shahnajfujaila@gmail.com

LinkedIn: Fujaila-Shahnaj

Location: Raleigh–Durham–Cary, NC

Skills: Power BI β€’ Data Engineering β€’ ML/NLP β€’ Python β€’ dbt β€’ DuckDB

About

Production-grade ETL pipeline for cardiovascular risk analytics with HIPAA-compliant data processing, automated quality validation, and enterprise BI dashboards.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published