This repository contains the implementation of LogicGPT, a fine-tuned version of Mistral 7B, it was fine-tuned on tasks such as logical reasoning, code generation, and comprehension questions. Below are the steps and details regarding model selection, quantization, fine-tuning, evaluation, API creation, containerization, and CI/CD pipeline.
The model selected for this project is Mistral 7B, it is chosen for the following reasons:
- Quantization support: The model can be quantized while maintaining good performance, in terms of remembering a conversation and applying it to the context, thus enabling effective deployment on CPU for inference.
- Resource efficiency: The model can be run on commercial computers with GPU; and, in this project, the final inference is executed on a CPU, making it accessible to wider usage.=
- Faster and better inference: Mistral 7B uses a novel attention mechanism called Grouped Query Attention, along with increased attention heads, a larger context length, and a Key-Value cache to boost performance.
Quantization was performed to enable faster inference on my local machine without a GPU. I used the q4_k_m quantization format, which offers a balanced quality and the final model is medium-sized. The quantization process is outlined in the notebook finetuning.ipynb.
For fine-tuning, I used the Open-Platypus dataset. It contains questions from diverse domains such as math, science, reading comprehension, coding, and logic.
The number of training sets of questions and answers for each of the data sources can be seen in the plot below. The data sources are not uniformly distributed at all.
The dataset was chosen for several reasons:
- LLM performance on logic and math: LLMs often struggle with logical reasoning and math tasks, so I wanted to see if I could improve this through fine-tuning.
- Preprocessing: The dataset underwent extensive preprocessing by the authors, including:
- Removal of sentence pairs with over 80% similarity, calculated using Sentence Transformers and keyword search.
- Exclusion of questions resembling those in Hugging Face benchmark test sets.
- Variety of sources: This allows for evaluating performance changes across a variety of tasks.
The fine-tuning process involved:
- Unsloth, a framework that facilitates fast and efficient fine-tuning.
- SFTTrainer from Hugging Face's Transformers library, which simplifies supervised fine-tuning tasks.
I used the LoRA (Low-Rank Adaptation) method for fine-tuning, setting the rank parameter to 16. Higher values can be used to improve performance further. The fine-tuning process is available in finetuning.ipynb
, which was run on Google Colab using a free NVIDIA T4 GPU.
To evaluate both the original and fine-tuned Mistral 7B models, I used the lm-evaluation-harness framework.
The fine-tuned model was evaluated as follows:
- I launched the server with the GGUF model using llama-cpp-python:
python -m llama_cpp.server --model model/unsloth.Q4_K_M.gguf
- I ran the evaluation with lm-evaluation-harness passing the local server address of the model:
lm_eval --model gguf --model_args base_url=http://localhost:8000 --tasks <task-name> --limit <number-of-questions> --log_samples --output_path '<results-output-path>'
For evaluating the original Mistral 7B model, I used Google Colab to speed up the process. The corresponding evaluation code is available in evaluation/original_model_evaluation.ipynb.
I evaluated both models on tasks similar to those in the fine-tuning dataset, including:
- Science questions from SciQ.
- Code-to-text tasks (i.e. comment generation for a code) from CodeXGLUE.
- Numerical calculations from Arithmetic.
- Reading comprehension from MC_TACO.
- Logical reasoning from LogiQA.
The metrics used for evaluation were:
- Smoothed BLEU-4 for the CodeXGLUE task (natural language generation), which focuses on n-gram matches between expected and actual outputs, ignoring the order.
- Accuracy for all other datasets (multiple-choice or discrete outputs).
The results, based on 20 questions per programming language for CodeXGLUE dataset, and 50 questions per question type for other datasets, can be seen in the folder evaluation. They are visualized in evaluation/visualization.ipynb.
As can be seen from the plot on accuracies above, the Fine-tuned model either matches or outperforms the Original Model for many tasks. However for some of them, the original had slightly (e.g. mc_taco and arithmetic_2dm) or much (e.g. logiqa) better performances. One explanation for that would be that training on various tasks at once affected performance on individual tasks, so there are generalization issues or there is overfitting.
For code-related tasks, the fine-tuned model showed significant improvements, particularly in the Go language, with around 5x better performance compared to the original model. However, performance on PHP was comparatively lower, though still superior to the original.
The API was built using FastAPI and loads the GGUF model with llama.cpp. Due to size constraints (over 2GB), the GGUF model must be downloaded from Google Drive before running the API.
To run the API locally:
-
Add your Docker Hub credentials and the latest Docker image commit ID to the
.env
file:DOCKERHUB_USERNAME=dockerhub_username DOCKERHUB_PASSWORD=dockerhub_password DOCKER_IMAGE_TAG=tag
-
Log in to docker from the command prompt with
docker login
. -
Download the model:
python -m pip install gdown gdown 1WGmDmHXTCmIqYHL-Jla3GUgtz_yFgzk1 -O ./model_files/unsloth.Q4_K_M.gguf chmod -R 755 ./model_files
-
Two options are possible for using the API:
- Start the API to directly use it with:
The API will be available at http://localhost:8000/docs.
docker compose -f compose.yaml up --build
- Alternatively, run the tests, using:
docker-compose -f compose_test.yaml up --build
- Start the API to directly use it with:
-
Stop the API after use:
docker compose down
A screenshot from the API with an output to a request can be seen in the image below.
The repository includes a Dockerfile with all the necessary dependencies for running the application. This setup is used for both local development and deployment in the CI/CD pipeline.
The repository also includes a CI/CD pipeline in .github/workflows/ci.yaml
that performs the following tasks:
- Lints the Python code.
- Builds the Docker image.
- Downloads the model.
- Starts the API and runs tests.
As can be seen from this Github Actions page of one of the latest commits, all the jobs are running successfully.
In case no changes were made to the Dockerfile, it is okay to skip Docker image building (which takes around 10 minutes). To do that, make the following changes to .github/workflows/ci.yaml:
- set
TO_BUILD_DOCKER
tofalse
(line 17) - specify the
PREV_IMAGE_TAG
for the latest Docker image (line 18), no need to change if the Dockerfile was not changed recently - comment line 81 (
needs: docker_build
in thetest_api_with_model
job), since we are not runningdocker-build
.
Undo the above steps to rebuild and save a new Docker image in the Docker Hub.