Skip to content

Pasta-coder/cyberlens

Repository files navigation

πŸ” SatyaSetu.AI

AI-Powered Digital Forensics & Fraud Detection Platform

SatyaSetu.AI is an intelligent forensic analysis platform that combines OCR, NLP, OSINT, and Machine Learning to detect fraud patterns in digital evidence and public procurement contracts.


πŸ—οΈ Architecture Overview

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                              SatyaSetu.AI                                      β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚        FRONTEND              β”‚               BACKEND (FastAPI)              β”‚
β”‚        (Next.js)             β”‚                                              β”‚
β”‚                              β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚  β€’ Evidence Upload           β”‚  β”‚           AI PIPELINES                  β”‚ β”‚
β”‚  β€’ Case Dashboard            β”‚  β”‚                                         β”‚ β”‚
β”‚  β€’ Entity Intelligence       β”‚  β”‚  πŸ“„ OCR Engine (Tesseract + PyMuPDF)    β”‚ β”‚
β”‚  β€’ Fraud Prediction UI       β”‚  β”‚  🧠 NER (spaCy en_core_web_sm)          β”‚ β”‚
β”‚  β€’ Batch Analysis            β”‚  β”‚  🎯 Scam Classifier (TF-IDF + SBERT)    β”‚ β”‚
β”‚  β€’ Report Generation         β”‚  β”‚  πŸ”— URL/QR Scanner + OSINT              β”‚ β”‚
β”‚                              β”‚  β”‚  βš–οΈ Fraud Predictor (XGBoost)           β”‚ β”‚
β”‚                              β”‚  β”‚  πŸ“Š Risk Assessor                       β”‚ β”‚
β”‚                              β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”‚                              β”‚                                              β”‚
β”‚                              β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚                              β”‚  β”‚           ML MODELS                     β”‚ β”‚
β”‚                              β”‚  β”‚                                         β”‚ β”‚
β”‚                              β”‚  β”‚  β€’ fraud_detection_model.pkl (XGBoost)  β”‚ β”‚
β”‚                              β”‚  β”‚  β€’ scam_classifier.pkl (Logistic Reg)   β”‚ β”‚
β”‚                              β”‚  β”‚  β€’ tfidf_vectorizer.pkl                 β”‚ β”‚
β”‚                              β”‚  β”‚  β€’ target_encoders.pkl                  β”‚ β”‚
β”‚                              β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

πŸš€ Features

πŸ“„ Evidence Processing Pipeline

Step Module Description
1 Upload Evidence Secure file upload with chain of custody logging
2 OCR Engine Extract text from images/PDFs using Tesseract & PyMuPDF
3 NER Extraction Identify entities (names, orgs, locations) using spaCy
4 Scam Classification AI-powered scam detection with confidence scoring
5 URL/QR Scanning Extract and analyze URLs with OSINT enrichment
6 Risk Assessment Comprehensive threat scoring and recommendations
7 Report Generation PDF forensic reports with all findings

βš–οΈ Fraud Detection (Public Procurement)

AI-powered corruption risk prediction for government contracts.


πŸ€– Machine Learning Models

1. 🚨 Fraud Detection Model (XGBoost)

Purpose: Predict corruption risk in public procurement contracts

Training Data:

  • Source: Digiwhist Romania 2023 (Open Contracting Data)
  • Size: 1.8M+ public procurement contracts
  • Target Variable: Composite Risk Indicator (CRI) β€” a score from 0-1

Model Performance:

Metric Value
RΒ² Score 0.74
Test RMSE 0.098
Train RMSE 0.091

Key Features Used:

Feature Description
tender_finalprice Final awarded contract value
tender_estimatedprice Initial estimated value
tender_recordedbidscount Number of bidders
price_efficiency Ratio: final_price / estimated_price
is_round_1000 Flag if final price is divisible by 1000
single_bidder_proxy Flag if only 1 bidder participated
title_length Length of contract title
is_medium_title Flag if title is 100-200 characters
is_sunday Flag if awarded on Sunday
is_december Flag if awarded in December
buyer_encoded Target-encoded buyer organization
winner_encoded Target-encoded winning company

Rule-Based Fraud Signals:

🚩 Round Number Trap    β€” Final price divisible by 1000
🚩 Single Bidder        β€” Only one bidder (bid rigging indicator)
🚩 Vague Title          β€” Contract title too short (<30 chars)
🚩 Cost Overrun         β€” Final price exceeds estimate by >5%
🚩 Sunday Award         β€” Unusual timing (awarded on Sunday)
🚩 December Rush        β€” Year-end budget spending rush

Risk Levels:

CRI Score Level Action
β‰₯ 0.70 πŸ”΄ CRITICAL Immediate investigation required
β‰₯ 0.50 🟠 HIGH Detailed audit recommended
β‰₯ 0.30 🟑 MODERATE Enhanced monitoring advised
< 0.30 🟒 LOW Standard oversight sufficient

2. 🎯 Scam Classifier

Purpose: Classify text evidence as potential scam/fraud

Architecture: Hybrid approach combining:

  • TF-IDF Vectorizer β€” Text feature extraction
  • Logistic Regression β€” Primary classifier
  • Sentence-BERT β€” Semantic similarity matching to known scam patterns

Scam Categories Detected:

  • Phishing / Fake Bank Emails
  • KYC Verification Scams
  • Lottery/Prize Scams
  • Investment Fraud
  • Tech Support Scams
  • Romance Scams

πŸ“‚ Project Structure

SatyaSeth.AI/
β”œβ”€β”€ backend/                    # FastAPI Backend
β”‚   β”œβ”€β”€ app/
β”‚   β”‚   β”œβ”€β”€ api/               # API Routes
β”‚   β”‚   β”‚   β”œβ”€β”€ upload_evidence.py
β”‚   β”‚   β”‚   β”œβ”€β”€ analyze.py
β”‚   β”‚   β”‚   β”œβ”€β”€ fraud_predict.py   # 🚨 Fraud Detection API
β”‚   β”‚   β”‚   β”œβ”€β”€ report.py
β”‚   β”‚   β”‚   β”œβ”€β”€ batch_analyze.py
β”‚   β”‚   β”‚   └── threat_hub.py
β”‚   β”‚   β”œβ”€β”€ pipelines/         # AI Processing Pipelines
β”‚   β”‚   β”‚   β”œβ”€β”€ ocr.py             # Tesseract + PyMuPDF
β”‚   β”‚   β”‚   β”œβ”€β”€ ner.py             # spaCy NER
β”‚   β”‚   β”‚   β”œβ”€β”€ scam_classifier.py # ML Scam Detection
β”‚   β”‚   β”‚   β”œβ”€β”€ url_qr_scanner.py  # URL/QR Extraction
β”‚   β”‚   β”‚   β”œβ”€β”€ osint_engine.py    # OSINT Lookups
β”‚   β”‚   β”‚   β”œβ”€β”€ risk_assessor.py   # Risk Scoring
β”‚   β”‚   β”‚   └── report_generator.py
β”‚   β”‚   β”œβ”€β”€ models/            # Trained ML Models
β”‚   β”‚   β”‚   β”œβ”€β”€ fraud_detection_model.pkl  # XGBoost (Romania)
β”‚   β”‚   β”‚   β”œβ”€β”€ scam_classifier.pkl
β”‚   β”‚   β”‚   β”œβ”€β”€ tfidf_vectorizer.pkl
β”‚   β”‚   β”‚   └── target_encoders.pkl
β”‚   β”‚   └── main.py            # FastAPI App Entry
β”‚   β”œβ”€β”€ Dockerfile
β”‚   └── requirements.txt
β”‚
β”œβ”€β”€ webapp/                    # Next.js Frontend
β”‚   β”œβ”€β”€ src/
β”‚   β”‚   β”œβ”€β”€ app/
β”‚   β”‚   β”‚   β”œβ”€β”€ fraud-predict/    # Fraud Prediction UI
β”‚   β”‚   β”‚   β”œβ”€β”€ entities/         # Entity Intelligence
β”‚   β”‚   β”‚   └── ...
β”‚   β”‚   └── lib/
β”‚   β”‚       └── api.ts            # API Client
β”‚   └── package.json
β”‚
└── docker-compose.yml

πŸ› οΈ Tech Stack

Layer Technology
Frontend Next.js 14, React, TailwindCSS, Framer Motion
Backend FastAPI, Python 3.10, Uvicorn
ML/AI XGBoost, scikit-learn, spaCy, Sentence-Transformers
OCR Tesseract, PyMuPDF
Database File-based (JSON/Pickle for MVP)
Deployment Docker, Render

πŸš€ Quick Start

Prerequisites

  • Python 3.10+
  • Node.js 18+
  • Docker (optional)

Backend Setup

cd backend

# Create virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: .\venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

# Download spaCy model
python -m spacy download en_core_web_sm

# Run server
uvicorn app.main:app --reload

Frontend Setup

cd webapp

# Install dependencies
npm install

# Create .env.local
echo "NEXT_PUBLIC_API_URL=http://127.0.0.1:8000/api" > .env.local

# Run development server
npm run dev

Docker Deployment

# Build and run with Docker Compose
docker-compose up --build

πŸ“‘ API Endpoints

Core Endpoints

Method Endpoint Description
POST /api/upload-evidence Upload evidence files
POST /api/analyze Full forensic analysis
GET /api/report/{file_id} Generate PDF report

Fraud Detection Endpoints

Method Endpoint Description
POST /api/fraud-predict Predict single contract risk
POST /api/fraud-predict/batch Batch contract analysis
GET /api/fraud-predict/model-info Model metadata

Example: Fraud Prediction Request

curl -X POST "http://localhost:8000/api/fraud-predict" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "Road Construction Phase 1",
    "department": "Ministry of Transport",
    "estimated_price": 500000,
    "final_price": 550000,
    "bidders": 1,
    "award_month": 12,
    "is_sunday": false,
    "is_december": true
  }'

πŸ“Š Model Training (Fraud Detection)

The fraud detection model was trained on public procurement data:

# Training notebook: notebooks/fraud_detection_training.ipynb

# Data source
data = pd.read_csv("data-romania-2023.csv")  # 1.8M+ contracts

# Target variable
# Composite Risk Indicator (CRI) β€” combines multiple red flags

# Model: XGBoost Regressor
model = XGBRegressor(
    n_estimators=200,
    max_depth=6,
    learning_rate=0.1,
    subsample=0.8,
    colsample_bytree=0.8
)

# Cross-validated on Belgium data for generalization

πŸ”’ Security Notes

  • CORS configured for specific origins in production
  • File uploads validated and sanitized
  • Chain of custody logging for evidence
  • No sensitive data stored in version control

πŸ‘₯ Team

Escape Da Vinci β€” Building AI for transparent governance


πŸ“„ License

MIT License β€” See LICENSE for details.


πŸ™ Acknowledgments

  • Digiwhist / OpenTender β€” Public procurement data
  • spaCy β€” NLP pipeline
  • XGBoost β€” Gradient boosting framework
  • FastAPI β€” Modern Python web framework

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •