End to End MLOPS Project with ETL Pipeline - Building Network Security Project using Phishing Dataset
This project demonstrates an End-to-End MLOps pipeline using a phishing dataset.The pipeline encompasses data ingestion, validation, transformation, model training, evaluation, and deployment, The goal of the project is to build a robust machine learning model to detect phishing websites using a complete MLOps setup, ensuring that it can be maintained, versioned, and deployed in production with ease.
The pipeline integrates tools and frameworks like MLflow, DagsHub, MongoDB, FastAPI, Docker, and GitHub Actions to automate processes like model training, deployment, and continuous integration/continuous deployment (CI/CD). automating everything from data ingestion to model deployment. This project focused on the MLOps lifecycle, leveraging powerful tools of machine learning workflows
constants/
: Holds configuration files used throughout the pipeline.entity/
: Contains definitions for core entities like models and data schemas.logging/
: Custom logging setup for tracking pipeline and model training activities.exception/
: Custom exceptions for handling errors in the pipeline.pipeline/
: Orchestrates the flow of the ML pipeline, from ETL to model training and inference.utils/
: General utility functions used across the project.
- Data Ingestion → 2. Validation → 3. Transformation → 4. Model Training → 5. Deployment with FastAPI → 6. Monitoring with MLflow
Data Ingestion
: Load raw phishing data from CSV files and store it in MongoDB for easy access and scalability.Data Validation
: Ensure the data's integrity by checking for missing values, duplicates, or outliers.Data Transformation
: Clean the data and apply feature engineering techniques (e.g., scaling, encoding, or feature extraction).Model Training
: Train machine learning models using different algorithms and tune hyperparameters.Model Evaluation
: Evaluate the trained models based on performance metrics like accuracy, precision, recall, and F1 score.Deployment
: Expose the trained model through a FastAPI application to make predictions on new phishing data.MLOps Automation
: Track experiments and models using MLflow and manage the entire pipeline with DagsHub. Dockerize the application for portability.
During the model development process, I performed hyperparameter tuning to optimize the model's performance. Various hyperparameters were tested across different configurations to find the best performing model.
Here are the results for different models tested:
- AdaBoost: 92.51%
- Random Forest: 89.08%
- Decision Tree: 86.35%
- Gradient Boosting: 82.43%
- Logistic Regression: 79.43%
After successfully setting up and running the project, the Flask API should be up and running. Below is an example of the output you would see when accessing the API.
DagsHub is used for tracking experiments and managing model versioning.
After training and fine-tuning the model, I evaluated its performance on a separate test dataset. Below is the Prediction Report in the form of an HTML table.
This project has been an exciting journey through the entire Machine Learning lifecycle. From data ingestion and cleaning to model training, evaluation, and deployment, I’ve built a robust end-to-end MLOps pipeline. The key goal of this project was to demonstrate how to structure and automate a complete machine learning workflow, making it ready for production.
By experimenting with multiple models and fine-tuning hyperparameters, I was able to identify the AdaBoost model as the most efficient for this particular task, achieving an accuracy of 92.51%. I then containerized the entire workflow using Docker, integrated continuous integration/continuous deployment (CI/CD) through GitHub Actions, and tracked model experiments using MLflow.
This pipeline not only automates the machine learning lifecycle but also makes it easier to deploy, monitor, and update the models as new data becomes available.