AI-Powered Digital Forensics & Fraud Detection Platform
SatyaSetu.AI is an intelligent forensic analysis platform that combines OCR, NLP, OSINT, and Machine Learning to detect fraud patterns in digital evidence and public procurement contracts.
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β SatyaSetu.AI β
ββββββββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββββββββββ€
β FRONTEND β BACKEND (FastAPI) β
β (Next.js) β β
β β βββββββββββββββββββββββββββββββββββββββββββ β
β β’ Evidence Upload β β AI PIPELINES β β
β β’ Case Dashboard β β β β
β β’ Entity Intelligence β β π OCR Engine (Tesseract + PyMuPDF) β β
β β’ Fraud Prediction UI β β π§ NER (spaCy en_core_web_sm) β β
β β’ Batch Analysis β β π― Scam Classifier (TF-IDF + SBERT) β β
β β’ Report Generation β β π URL/QR Scanner + OSINT β β
β β β βοΈ Fraud Predictor (XGBoost) β β
β β β π Risk Assessor β β
β β βββββββββββββββββββββββββββββββββββββββββββ β
β β β
β β βββββββββββββββββββββββββββββββββββββββββββ β
β β β ML MODELS β β
β β β β β
β β β β’ fraud_detection_model.pkl (XGBoost) β β
β β β β’ scam_classifier.pkl (Logistic Reg) β β
β β β β’ tfidf_vectorizer.pkl β β
β β β β’ target_encoders.pkl β β
β β βββββββββββββββββββββββββββββββββββββββββββ β
ββββββββββββββββββββββββββββββββ΄βββββββββββββββββββββββββββββββββββββββββββββββ
| Step | Module | Description |
|---|---|---|
| 1 | Upload Evidence | Secure file upload with chain of custody logging |
| 2 | OCR Engine | Extract text from images/PDFs using Tesseract & PyMuPDF |
| 3 | NER Extraction | Identify entities (names, orgs, locations) using spaCy |
| 4 | Scam Classification | AI-powered scam detection with confidence scoring |
| 5 | URL/QR Scanning | Extract and analyze URLs with OSINT enrichment |
| 6 | Risk Assessment | Comprehensive threat scoring and recommendations |
| 7 | Report Generation | PDF forensic reports with all findings |
AI-powered corruption risk prediction for government contracts.
Purpose: Predict corruption risk in public procurement contracts
Training Data:
- Source: Digiwhist Romania 2023 (Open Contracting Data)
- Size: 1.8M+ public procurement contracts
- Target Variable: Composite Risk Indicator (CRI) β a score from 0-1
Model Performance:
| Metric | Value |
|---|---|
| RΒ² Score | 0.74 |
| Test RMSE | 0.098 |
| Train RMSE | 0.091 |
Key Features Used:
| Feature | Description |
|---|---|
tender_finalprice |
Final awarded contract value |
tender_estimatedprice |
Initial estimated value |
tender_recordedbidscount |
Number of bidders |
price_efficiency |
Ratio: final_price / estimated_price |
is_round_1000 |
Flag if final price is divisible by 1000 |
single_bidder_proxy |
Flag if only 1 bidder participated |
title_length |
Length of contract title |
is_medium_title |
Flag if title is 100-200 characters |
is_sunday |
Flag if awarded on Sunday |
is_december |
Flag if awarded in December |
buyer_encoded |
Target-encoded buyer organization |
winner_encoded |
Target-encoded winning company |
Rule-Based Fraud Signals:
π© Round Number Trap β Final price divisible by 1000
π© Single Bidder β Only one bidder (bid rigging indicator)
π© Vague Title β Contract title too short (<30 chars)
π© Cost Overrun β Final price exceeds estimate by >5%
π© Sunday Award β Unusual timing (awarded on Sunday)
π© December Rush β Year-end budget spending rush
Risk Levels:
| CRI Score | Level | Action |
|---|---|---|
| β₯ 0.70 | π΄ CRITICAL | Immediate investigation required |
| β₯ 0.50 | π HIGH | Detailed audit recommended |
| β₯ 0.30 | π‘ MODERATE | Enhanced monitoring advised |
| < 0.30 | π’ LOW | Standard oversight sufficient |
Purpose: Classify text evidence as potential scam/fraud
Architecture: Hybrid approach combining:
- TF-IDF Vectorizer β Text feature extraction
- Logistic Regression β Primary classifier
- Sentence-BERT β Semantic similarity matching to known scam patterns
Scam Categories Detected:
- Phishing / Fake Bank Emails
- KYC Verification Scams
- Lottery/Prize Scams
- Investment Fraud
- Tech Support Scams
- Romance Scams
SatyaSeth.AI/
βββ backend/ # FastAPI Backend
β βββ app/
β β βββ api/ # API Routes
β β β βββ upload_evidence.py
β β β βββ analyze.py
β β β βββ fraud_predict.py # π¨ Fraud Detection API
β β β βββ report.py
β β β βββ batch_analyze.py
β β β βββ threat_hub.py
β β βββ pipelines/ # AI Processing Pipelines
β β β βββ ocr.py # Tesseract + PyMuPDF
β β β βββ ner.py # spaCy NER
β β β βββ scam_classifier.py # ML Scam Detection
β β β βββ url_qr_scanner.py # URL/QR Extraction
β β β βββ osint_engine.py # OSINT Lookups
β β β βββ risk_assessor.py # Risk Scoring
β β β βββ report_generator.py
β β βββ models/ # Trained ML Models
β β β βββ fraud_detection_model.pkl # XGBoost (Romania)
β β β βββ scam_classifier.pkl
β β β βββ tfidf_vectorizer.pkl
β β β βββ target_encoders.pkl
β β βββ main.py # FastAPI App Entry
β βββ Dockerfile
β βββ requirements.txt
β
βββ webapp/ # Next.js Frontend
β βββ src/
β β βββ app/
β β β βββ fraud-predict/ # Fraud Prediction UI
β β β βββ entities/ # Entity Intelligence
β β β βββ ...
β β βββ lib/
β β βββ api.ts # API Client
β βββ package.json
β
βββ docker-compose.yml
| Layer | Technology |
|---|---|
| Frontend | Next.js 14, React, TailwindCSS, Framer Motion |
| Backend | FastAPI, Python 3.10, Uvicorn |
| ML/AI | XGBoost, scikit-learn, spaCy, Sentence-Transformers |
| OCR | Tesseract, PyMuPDF |
| Database | File-based (JSON/Pickle for MVP) |
| Deployment | Docker, Render |
- Python 3.10+
- Node.js 18+
- Docker (optional)
cd backend
# Create virtual environment
python -m venv venv
source venv/bin/activate # On Windows: .\venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt
# Download spaCy model
python -m spacy download en_core_web_sm
# Run server
uvicorn app.main:app --reloadcd webapp
# Install dependencies
npm install
# Create .env.local
echo "NEXT_PUBLIC_API_URL=http://127.0.0.1:8000/api" > .env.local
# Run development server
npm run dev# Build and run with Docker Compose
docker-compose up --build| Method | Endpoint | Description |
|---|---|---|
| POST | /api/upload-evidence |
Upload evidence files |
| POST | /api/analyze |
Full forensic analysis |
| GET | /api/report/{file_id} |
Generate PDF report |
| Method | Endpoint | Description |
|---|---|---|
| POST | /api/fraud-predict |
Predict single contract risk |
| POST | /api/fraud-predict/batch |
Batch contract analysis |
| GET | /api/fraud-predict/model-info |
Model metadata |
curl -X POST "http://localhost:8000/api/fraud-predict" \
-H "Content-Type: application/json" \
-d '{
"name": "Road Construction Phase 1",
"department": "Ministry of Transport",
"estimated_price": 500000,
"final_price": 550000,
"bidders": 1,
"award_month": 12,
"is_sunday": false,
"is_december": true
}'The fraud detection model was trained on public procurement data:
# Training notebook: notebooks/fraud_detection_training.ipynb
# Data source
data = pd.read_csv("data-romania-2023.csv") # 1.8M+ contracts
# Target variable
# Composite Risk Indicator (CRI) β combines multiple red flags
# Model: XGBoost Regressor
model = XGBRegressor(
n_estimators=200,
max_depth=6,
learning_rate=0.1,
subsample=0.8,
colsample_bytree=0.8
)
# Cross-validated on Belgium data for generalization- CORS configured for specific origins in production
- File uploads validated and sanitized
- Chain of custody logging for evidence
- No sensitive data stored in version control
Escape Da Vinci β Building AI for transparent governance
MIT License β See LICENSE for details.
- Digiwhist / OpenTender β Public procurement data
- spaCy β NLP pipeline
- XGBoost β Gradient boosting framework
- FastAPI β Modern Python web framework