This project applies Machine Learning and Natural Language Processing (NLP) techniques to analyze and classify corporate emails from the Enron Email Dataset.
It extracts semantic patterns, relationships, and insights from large-scale email communication data to help visualize and interpret professional correspondence.
- Frontend (Website): https://proinsight-frontend.vercel.app
- Backend API: https://proinsight-backend.onrender.com
Source: Enron Email Dataset (Kaggle)
Cleaning Process:
Raw email data was parsed using Python’s email module to extract:
Message-IDDateFromToSubjectBody
The cleaned dataset was saved as emails_clean.csv for downstream NLP and ML analysis.
- Data Cleaning: Removal of stopwords, punctuation, and non-ASCII characters.
- Tokenization & Lemmatization: Performed using SpaCy.
- Feature Extraction: TF-IDF vectorization and word frequency analysis.
- Network Analysis: Constructed sender–receiver communication graphs using NetworkX.
- Data Parsing & Cleaning — Extracts and structures raw email data.
- Exploratory Data Analysis (EDA) — Analyzes communication frequency, sentiment, and relationships.
- Feature Engineering — Uses TF-IDF and embeddings for semantic representation.
- Classification / Clustering — Identifies thematic or behavioral patterns in email content.
- Visualization — Builds network graphs using NetworkX and Matplotlib.
The project integrates Google’s Gemini API for:
- Text summarization
- Semantic similarity comparison
- Context-aware keyword extraction
- Insight generation on communication trends
- Languages & Libraries: Python, Pandas, NumPy
- NLP Tools: SpaCy, TextBlob
- ML Framework: scikit-learn
- Visualization: Matplotlib, NetworkX
- API: Gemini API
- Frontend: React (Vite + Tailwind + shadcn/ui)
- Backend: FastAPI (deployed on Render)
git clone https://github.com/nikitagrover19/ProInsight-ML.git
cd ProInsight-ML
cd scripts