Enterprise-grade healthcare analytics platform for cardiovascular risk assessment, built on a Kaggle dataset of ~70,000 patient records.
CardioInsight-AI demonstrates production-style:
- Data engineering
- Data quality & validation
- ML modeling
- BI dashboarding
All in one cohesive project:
- HIPAA-style de-identification and clinical feature engineering
- 20+ data quality checks with JSON + HTML reports
- dbt + DuckDB star schema and marts for analytics
- ML pipeline (Logistic Regression + Random Forest) β ROC-AUC β 0.79
- Power BI dashboard with population insights & patient-level risk drilldown
This system mirrors real-world workflows used in hospital analytics teams (Duke Health, UNC Health, CVS Health/Optum, Mayo Clinic, etc.).
Can we build an end-to-end platform that turns raw cardiovascular measurements into high-quality, explainable risk insights for clinicians and decision makers?
CardioInsight-AI answers this by:
- Ingesting and de-identifying a publicly available cardiovascular dataset
- Cleaning, validating, and transforming it into a star schema analytics mart
- Training ML models to predict cardiovascular events
- Delivering insights through a two-page Power BI report:
- Page 1: Population Risk Overview & Segment Analysis
- Page 2: Patient Risk Explorer & Clinical Drilldown
- Read
cardio_train.csv(~70K rows) - Remove direct identifiers
- Create
patient_id,age_years,age_band,bmi,bmi_band - Engineer features:
- Pulse pressure
- Hypertension flag
- Cholesterol/glucose categories
- Output β
data/lake/cardio_deid_data.csv
- Script:
data_quality/dq_validators.py - Runs:
- Missingness checks
- Clinical range checks (age, BP, BMI, height, weight)
- Logical consistency (
ap_hi β₯ ap_lo) - Uniqueness of
patient_id
- Outputs:
data/quality_reports/dq_report.jsondata/quality_reports/dq_report.html
- Warehouse:
cardio_warehouse.duckdb - dbt models:
- Staging model β
stg_cardioinsight - Mart β
mart_cardio_risk
- Staging model β
- dbt tests enforce:
- Not-null constraints
- Accepted values
- Uniqueness of
patient_id
- Script:
ml/models/ml_pipeline.py - Loads
mart_cardio_risk - Performs train/test split
- Trains two models:
- Logistic Regression
- Random Forest
- Logistic Regression ROC-AUC β 0.79
- Saves model artifacts to
ml/models/artifacts/
- Script:
data/exports/export_mart_to_csv.py - Exports BI-ready file:
data/processed/mart_cardio_risk.csv
- Power BI consumes this CSV to power:
- KPI tiles
- CVD funnel
- Segment analyses
- ML-integrated patient risk explorer
|
Raw CSV Data Kaggle dataset (~70K) |
β‘οΈ |
Data Lake De-identified, cleaned data (Python ETL) |
β‘οΈ |
DuckDB Warehouse Star schema marts (dbt models) |
β‘οΈ |
ML Models Logistic Regression & Random Forest |
β‘οΈ |
Insights Layer Power BI KPIs, funnel, patient explorer |
- Python (
pandas,numpy,scikit-learn,duckdb) dbt-core+dbt-duckdb- Power BI (DAX, M)
- Git, virtualenv
- HIPAA de-identification
- Data quality validation
- Star schema modeling
- ML classification modeling
- BI visual analytics (clinical context)
CardioInsight-AI/ ββ etl/ β ββ hipaa_de_identification.py β ββ build_warehouse.py β ββ data/ β ββ raw/ β ββ lake/ β ββ warehouse/ β ββ exports/export_mart_to_csv.py β ββ data_quality/ β ββ dq_validators.py β ββ quality_reports/ β ββ cardioinsight_dbt/ β ββ dbt_project.yml β ββ models/staging/ β β ββ stg_cardioinsight.sql β ββ models/marts/ β β ββ mart_cardio_risk.sql β ββ ml/models/ β ββ ml_pipeline.py β ββ artifacts/ β ββ dashboards/ β ββ CardioInsight-AI.pbix β ββ requirements.txt ββ README.md
- Python 3.10+
piporconda- Power BI Desktop
- Git (optional)
- (Optional) Conda / virtualenv
# Clone the repo git clone https://github.com/<your-username>/CardioInsight-AI.git cd CardioInsight-AI # Create and activate a virtual env (recommended) python -m venv .venv # Windows: .venv\Scripts\activate # macOS/Linux: # source .venv/bin/activate # Install dependencies pip install -r requirements.txt
Download the Kaggle dataset and place it as:
data/raw/cardio_raw_data.csv
python etl/hipaa_de_identification.py python etl/build_warehouse.py python data_quality/dq_validators.py
After this, check:
data/lake/cardio_deid_data.csvdata/quality_reports/dq_report.jsondata/quality_reports/dq_report.html
cd cardioinsight_dbt dbt debug dbt build --full-refresh # Or run specific models: dbt run --select stg_cardioinsight dbt run --select mart_cardio_risk
cd .. python ml/models/ml_pipeline.py
This will:
- Load
mart_cardio_risk - Train Logistic Regression & Random Forest
- Print ROC-AUC and classification metrics
- Save model artifacts
python data/exports/export_mart_to_csv.py
This writes the BI-ready file:
data/processed/mart_cardio_risk.csv
- Logistic Regression AUC: ~0.79
- Random Forest AUC: ~0.77
- Best model integrated into Power BI
- Patient-level predictions include expected vs actual clinical values
This dashboard visualizes cardiovascular risk insights using:
- Cleaned & feature-engineered dataset
- dbt-built clinical risk mart
- Logistic Regression ML model
- Patient-level ML risk drilldown
- Population-level epidemiological patterns
- KPI Tiles: Total Patients, CVD Risk %, High-Risk Patients
- CVD Risk Funnel: Population β CVD Events β Hypertension β High Cholesterol
- BMI Band distribution
- Age Band donut
- Cholesterol & Glucose stacked bars
- Hypertension distribution by age
- Filter panel:
- Age Band
- BMI Band
- Cholesterol Category
- Glucose Category
- Smoking Status
- Alcohol Use
- Activity Level
- Patient selector (drop-down)
- Patient profile panel
- Clinical indicators:
- Hypertension Status
- ML Risk Flag
- CVD Observed
- Pulse Pressure
- ML Output:
- Predicted CVD Probability (Gauge)
- Patient vs Population Comparisons:
- Systolic BP
- Diastolic BP
- BMI
The dashboard uses ML predictions generated by the Python pipeline:
- Logistic Regression probability
- High/Moderate/Low Risk classification
- Combined with clinical thresholds (BP, cholesterol, BMI) β strong, explainable indicators
π View the Interactive Power BI Dashboard
This platform demonstrates:
- Real-world data engineering
- Healthcare-grade data cleaning & validation
- ML model development & deployment
- BI storytelling with clinical insights
- Full end-to-end architecture
Aligned with roles in:
- Healthcare Analytics
- Biotech
- Data Engineering
- Machine Learning Engineering
Email:
shahnajfujaila@gmail.com
LinkedIn:
Fujaila-Shahnaj
Location: RaleighβDurhamβCary, NC
Skills: Power BI β’ Data Engineering β’ ML/NLP β’ Python β’ dbt β’ DuckDB