Skip to content

wandabwa2004/DS_Projects

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

94 Commits
 
 

Repository files navigation

Artificial Intelligence(AI),Generative AI (Gen AI), Data Science, Engineering and Analytics Projects and Solutions

Senior Data Scientist with extensive experience in the financial sector, specializing in developing innovative AI and data-driven solutions. My portfolio showcases a range of self-driven projects that demonstrate my passion for pushing the boundaries of technology and solving complex problems.

My work spans multiple domains:

  1. Artificial Intelligence: Custom AI solutions leveraging deep learning and machine learning algorithms to solve complex business challenges.
  2. Generative AI and LLMs: Cutting-edge projects exploring the capabilities of large language models, text-to-image generation, and other generative technologies to create innovative applications.
  3. Data Science: End-to-end data science solutions including predictive modeling, statistical analysis, and advanced feature engineering techniques.
  4. Engineering: Robust data engineering pipelines and scalable architectures that support production-ready AI systems.
  5. Analytics: In-depth analytical projects showcasing business intelligence, data visualization, and actionable insights extraction.

Each project in this portfolio represents independent work completed alongside my professional role in financial services, reflecting my commitment to continuous learning and technical excellence. These solutions demonstrate my ability to translate complex technical concepts into practical, value-driving applications.

Python version GitHub Medium Data  Science  Blog

Table of Contents

1. Large Language Models (LLMs), Generative AI (Gen AI) and AI Agents

2. Python

3. R

4. SAS

5. Visualizations (PowerBI, Tableau, ScatterText)

6. Model Deployment

Projects

Large Language Models (LLMs), Generative AI (Gen AI) and Agent

Predictive Maintenance Optimization and Advisory Agentic System - LangChain and GPT-4

Link to the repo and code:

This repository contains a predictive maintenance workflow of agents for an electricity utility company that:

  1. Ingests equipment data (e.g., transformers, poles, insulators, etc.)
  2. Analyzes it locally (summary statistics, risk measures).
  3. Runs an optimization step (Mixed-Integer Programming, MIP) to decide which equipment to maintain under a given budget.
  4. Summarizes the large maintenance schedule to avoid token-limit issues.
  5. Generates high-level financial and risk recommendations using LLM-based agents.
  6. Combines these advisors’ outputs into a coherent executive summary via a communicator agent.

Research Agent with with OpenAI

Link to the repo and code:

This project implements a Research Agent that combines state-of-the-art GPT models, with search and summarization capabilities.

Research Agent with with Llama

Link to the repo and code:

This project implements a Research Agent that combines state-of-the-art language models, like LLaMA and Falcon, with search and summarization capabilities. The agent can:

  • Perform web searches using DuckDuckGo.
  • Summarize long texts with advanced LLMs.
  • Automatically use GPU if available or fallback to CPU for processing.

Job and CV/Resume Matching System with Open AI

Link to the repo and code:

A multi-step AI application that matches one or more CVs against a Job Description using an LLM-based approach. Users can copy-paste or upload a job description (PDF, DOCX, or TXT), then upload multiple CVs, and the system will rank them based on 0–10 match scores and provide explanations of the strengths and weaknesses for each CV.

Features

  • Flexible Job Description Input Either copy-paste directly into a text area or upload a file in PDF, DOCX, or TXT.
  • Multiple CV Uploads Drag and drop (or browse) multiple CV files of various formats.
  • AI-Generated Matching & Explanation The system uses a multi-step logic:
    • Summarize the job description.
    • Summarize each CV.
    • Compute an embedding-based similarity.
    • Generate a final match score (0–10) and bullet-point explanation.
    • Interactive Web Interface Built with Streamlit, ensuring a simple, browser-based UI.

RAG: An AI-Powered Document-based Question-Answering (QA) System - GPT Driven

Link to the repo and code:

A secure, privacy-focused Retrieval-Augmented Generation (RAG) system designed for local document-based Question-Answering (QA). This application enables users to upload documents, extract relevant text, and retrieve answers to queries using OpenAI's GPT models—all while keeping data local.

Features:

  1. Document Upload: Upload and process documents in supported formats (PDFs, Word files, and plain text files).
  2. Text Extraction: Automatically extract and preprocess text from uploaded documents.
  3. Vector Store for Retrieval: Stores document content as embeddings and retrieves the most relevant sections for user queries.
  4. Interactive Q&A: Provides accurate answers to user queries based on document context.
  5. Streamlit Interface: A user-friendly interface for managing documents and querying.
  6. Local Processing: Ensures maximum privacy by processing data locally.

RAG: An AI-Powered Document-based Question-Answering (QA) System - LLama] Link to the repo and code:

The Kenyan Constitution Chatbot is an AI-powered application that allows users to upload the Constitution of Kenya 2010 PDF file, parse it, and ask specific questions about its content. It leverages AI models and document embeddings to provide concise, accurate, and context-aware answers.

Features 🛠️ Document Parsing: Converts uploaded PDFs into machine-readable text. 🔍 Intelligent Search: Uses vector search and document splitting to ensure efficient querying. 📊 AI-Generated Answers: Provides precise, formatted answers to questions based on the document. 🚀 Fast Processing: Optimized response time for queries. 💬 Interactive Chat: Users can input queries and receive context-rich answers.

Technology Stack

  • Frontend: Streamlit
  • Document Parsing: LlamaParse API
  • Vector Store: Qdrant
  • Embeddings Model: FastEmbed (BAAI/bge-base-en-v1.5)
  • Language Model: Groq API (llama3-70b-8192)
  • Document Loader: UnstructuredMarkdownLoader
  • Compression: Flashrank Rerank

KnowItAll: Your AI-Powered Assistant

Link to the repo and code:

I detail how I developed a multi-functional chatbot named KnowItAll, powered by OpenAI’s GPT-4 API, to handle diverse tasks such as answering questions, generating content, translating text, and writing code. While GPT-4 inherently possesses the ability to generate responses based on contextual understanding, this alone does not make it an intuitive assistant for specific use cases. To address this, KnowItAll leverages carefully crafted system and feature-specific prompts to align the AI's responses more closely with user expectations. By tailoring the input prompts to predefined user needs—such as content generation or code writing—the chatbot effectively serves as a highly capable and adaptive virtual assistant. This approach ensures that the chatbot not only understands explicit instructions but also delivers responses optimized for each feature, enhancing user interaction and satisfaction.

Python

Transforming Customer Experiences: How a Finetuned Llama 2 Model Can Empower Product FAQs:

Link to Article: with all code

I detail how I finetuned an open-source Llama 2 model with Safaricom’s product and service-related FAQ question and answer pairs. Models such as Llama 2 possess the capability to predict the subsequent token within a sequence. However, this predictive ability alone does not render them highly effective virtual assistants, as they do not inherently respond to explicit instructions. To bridge this gap, a technique known as instruction tuning is applied to align their responses more closely with human expectations. I made use of Supervised Fine-Tuning (SFT) with the FAQ pairs. In this approach, models are subjected to training on a dataset consisting of paired instructions and corresponding responses, as in our case with the Question-Answer pairs. The goal is to optimize the internal model parameters within the LLM to minimize the disparity between the generated answers and the ground-truth responses, which serve as reference labels.

Link to Article: with all code

This is part one of a two-part series where I build a scraper to get most FAQs about Safaricom products to use later on in fine-tuning an open-source Llama 2 Large Language Model on the data and eventually developing a chatbot that users could interact with the fine-tuned model.

I used the following Python packages:

  1. BeautifulSoup: to parse HTML and XML documents, making it easier to extract information from web pages.
  2. Selenium: to automate interactions with the website. It’s particularly useful for scraping dynamic content and interacting with JavaScript-driven pages.
  3. Pandas: to manipulate and store the data.
  4. Random: to add random delays between requests to avoid overloading the server.

I was able to scrape and store 1759 non-null product-related FAQs and their answers here https://www.safaricom.co.ke/media-center-landing/frequently-asked-questions

Link to Article: with all code

The Hansard and Audio Services Directorate within the Kenyan Parliament is responsible for recording and producing verbatim reports of parliamentary proceedings and committee deliberations. With a curiosity to understand the topics discussed by members of parliament over time, I sought to explore the Hansard reports for specific sessions or sittings. However, due to the length of these reports and the challenge of identifying relevant dates, this endeavor proved to be time-consuming and potentially unproductive.

This led me to explore effective methods of querying PDF documents and obtaining insightful information on specific topics. After considering various options, I decided to leverage Large Language Models (LLMs), with OpenAI being my preferred choice.

The analysis follows the below : -

  1. Sourcing data: Extracting PDFs from the official website, as they are publicly available.
  2. PDFs Validity Check
  3. Setting up dependencies: Configuring the necessary software libraries and tools.
  4. Querying the Hansard reports for 2018: Although there is no particular significance attributed to the 2018 reports, I chose this subset for demonstration purposes.
  5. Summarizing PDFs

Link to Notebook
Link to Article

Power outages are a prevalent challenge encountered by utility companies, highlighting the need for a thorough analysis of historical data to understand patterns and trends. I analysed the historical outage data for Ausgrid, Australia’s largest electricity distributor, which services 1.7 million customers across Sydney, the Hunter Valley, and the Central Coast.

In summary:

  1. The analysis shows that equipment faults have consistently been a major reason for power outages in the Ausgrid network.
  2. The year 2020 recorded the highest number of outages during the period covered by the dataset.
  3. Based on the data, Gosford, Hornsby, and Wyong were the locations most affected by power outages. Gosford’s cause of outages is mostly environmental-related factors.
  4. Power outages were found to be more prevalent in the afternoons, with a peak at around 6 PM across all days of the week. As such, Ausgrid’s rostering for the afternoon shift should consider having more workers on standby. Mornings and late evenings are usually quieter periods in terms of power outages.
  5. The analysis indicates that power outages are more prevalent on Saturdays. Therefore, there is a need for better workforce planning on this day.

Link to Notebook

This is a prediction problem based on a time-series dataset of online sales of a UK-based store. The company sells unique all-occasion giftware. Wholesalers make up a high number of their customers. The sales data is from 01/12/2009 to 09/12/2011. The problem here is to predict the sales for the next 22 days based on this historical data as the owner is interested in knowing the expected revenue at this time to be sure of the sports car he buys his partner for Christmas.

Dataset Dataset has 1067371 sales records. Each record is identified by 8 attributes i.e. Invoice, StockCode, Description, Quantity, InvoiceDate, Price, Customer ID and Country . Individual descriptions are found here https://www.kaggle.com/mashlyn/online-retail-ii-uci#

Dataset name: online_retail_II.csv and can be found here https://www.kaggle.com/mashlyn/online-retail-ii-uci. I could not directly upload it here due to the 25MB size limitation.

What the Notebook Covers:

  1. Ingesting the dataset
  2. Perform Exploratory Data Analysis (EDA). This includes operations related to: - a) Total daily, weekly, and monthly sales volumes. b) Last months’ revenue share by product and by customer. c) Weighted average monthly sale price by volume
  3. Data Cleaning and Encoding
  4. Data Modelling (Using Facebook's Prophet)in relation to time series-based revenue prediction.

Sales Data  Timeseries Modelling

Link to Notebook

EDA and analytics on a historical dataset on the modern Olympic Games, including all the Games from Athens 1896 to Tokyo in 2020. The data was scraped from www.sports-reference.com.

Objective: To visualize how Olympics has evolved over time with special emphasis on African countries that began participating quite many years after the competitions began. This is achieved by merging and visualizing output from the above datasets.

Olympics Data  Analytics

Link

Social network mining for users within Kenya's Deputy President's Twitter account. Three significant weaknesses are in this network setup: -

  • Isolated users — Isolates in the network, more so around @WilliamSRuto’s cluster are many. This means that they are likely to miss out what for example @MbuiMumbi or @oleitumbi disseminates, unless is re-shared or by @WilliamSRuto which may not always be the case. This is depicted by the low Reciprocated Vertex Pair Ratio.
  • Weak inter and intra cluster edges — Connections between clusters are weak, less for G1 to G5. This means content in the clusters is less likely to reach all users in it. The situation is even worse for inter-cluster connections.
  • Influence isolation — @oleitumbi is the only user of influence in this collection period. The user is a prime target for account suspension e.g. if someone reports of any policy violations. This is depicted by the low graph density value.

DP Ruto's  Twitter Social Network Engine

Repository

Tool designed and developed using Python and Streamlit to help you upload files to an online Sharepoint location. This works with Sharepoint 365 but can be modified to fit earlier SharePoint versions. Current functionality includes:

  • Specifying the folder path to the files to be uploaded (Source URL).
  • Summary information of the files to be uploaded.
  • Specification of Sharepoint login and related upload details.
  • Creation of a folder based on the todays date format in the base URL that is user specified.
  • Upload of the files matching the specified extension (currently .xlsx) to the new folder in the base URL. File format can be changed

Sharepoint Uploader

Notes on Usage

  • A deployed version of the app can be found here https://sharepointuploader.herokuapp.com/. The app can also be cloned and run locally using streamlit: streamlit run SharepointUploader.py. When doing this, ensure you have the required modules listed in the requirements file.
  • Make sure the account details for accessing Sharepoint on your domain are valid. Normally, the username is your domain specific email and password.

Bugs, Enhancements and Comments

All comments, bug reports and enhancement requests are welcome. To do so, please submit a new issue and I will work hard on improving the app.

Future Functionality

Future functionality will likely include:

  • Option to specify file formats to be uploaded in a folder with mixed file types.
  • Email trigger to the username once the files are all uploaded.

Repository | Notebook | nbviewer

  • Descriptive and Predictive Analytics for a Synthetic dataset on Financial crimes.
  • The dataset https://www.kaggle.com/ntnu-testimon/paysim1/download is a synthetic one i.e. simulated using PaySim based on a sample of real transactions extracted from one month of financial logs from a mobile money service implemented in an African country. The dataset is scaled down 1/4 of the original dataset.
  • Used sweetviz package for Exploratory Data Analysis (EDA).
  • Identified the most probable fraud indicators.
  • XGBoost and RandomForest Classifiers with Area under the precision-recall curve (AUPRC) as the metric for the skewed dataset.

Correlation Plot for Different Factors in Financial Crime

Conclusion:

  1. Fraud detection is a difficult process. This is especially compounded by the lack of integral data in the area.
  2. Tree based algorithms worked better in detection of fraud. This is partly attributed to the nature of data.

In this project, I setup a tweets collection framework for tweets belonging to five politicians in Kenya. I analyzed the tweet sentiments/emotions over time, packaged the same in a Streamlit App and hosted the same on Heroku.

Code | Deployed App

Repository | Notebook | nbviewer | Blog Article

  • Analytics of my body, activity and sleep data during the COVID-19 lockdown.
  • Identification of important factors that necessitated weight loss during the lockdown time. Fitbit Data Analytics

Repository | Notebook | nbviewer

  • Collection of streaming tweets from Auckland and Wellington, New Zealand's largest cities during COVID-19 Lockdown period.
  • Descriptive analytics on the tweeting patterns by users from the two cities. Aucklanders seemed to work more than tweet.
  • Topics of discussion were semantically identical across the cities. Visualized by PyLDAVis.

Tweeting  Patterns during  COVID-19 Lockdown

Repository | Notebook | nbviewer

The objective of the competition was to create a machine learning model to help Kenyan non-profit organization Local Ocean Conservation anticipate the number of turtles they will rescue from each of their rescue sites as part of their By-Catch Release Programme https://zindi.africa/competitions/sea-turtle-rescue-forecast-challenge.

  • Descriptive analytics and EDA for the dataset. Included encoding etc for better modelling.
  • 1.1897261428182493 RMSE as the measurement metric.

Turtles Capture and  Release Programme

Repository | Notebook | nbviewer | Blog Article

  • Collection of tweets from the different cities via geocoding.
  • Translation via GoogleTranslate Python library for modelling.
  • Descriptive analytics of the datasets per country.
  • Modelling via BERT and batch sentiment prediction per tweet and grouped by cities.
  • The highest probability to a sentiment was assumed to be the true sentiment of the tweet.

BERT Happiness Index

Article

Code Snippets with:-

  • Descriptive Analytics of Trip Advisor reviews for Museum of New Zealand (Te Papa Tongarewa)
  • Sentiment Analysis of the reviews.

Sentiment Over Time

Article

The project was an anlytical piece about what Kenyans really discuss online. Data in form of tweets was from January to December 2019.

Questions of Interest:-

  1. Are we able to deduce the nature of Kenyans based on their daily chatter? Do they talk about substantive issues?
  2. Are they topically consistent in their talk over time?

Nairobi City from the Space Station

R

Repository | Code

  • Descriptive Analytics of tweets geolocated to New Zealand.
  • Emotions Analysis in the collected tweets

Emotions Distribution

Repository | Code Blog Article

  • Descriptive Analytics of Uhuru Kenyatta's State of the Nation speeches from 2014 - 2019.
  • The language in Uhuru Kenyatta's State of the Nation addresses from 2014-2019 never changed.
  • His 2015's speech was the most difficult in terms of readability i.e. needed someone at postgraduate/ advanced undergraduate level to read and probably understand.
  • His 2019 speech was the most positive of all his speeches

Polarity in the speeches over time

SAS

Repository | Code | Dataset | Analytics Output

Visualizations (PowerBI, Tableau, ScatterText)

Code

  • COVID-19 Numbers by 10/03/2020 in PowerBI.

COVID-19 Numbers as at 10/03/2020

Repository

Used Scattertext package in Python to interactively visualize reviews related to the Museum of New Zealand.

Visualization of Terms Code | nbviewer

[Visualization of Topics Code] (https://lnkd.in/gzvNBwF) | nbviewer

Terms visualizations  in ScatterText

Model Deployment

Code | Deployed App

Political Sentiment Analyzer

Data Science Foundations

About

Data Science Portfolio

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published