World's first bi-drectional brainrot translator.
Youwei Zhen 2024
Imagine collecting thousands of brainrotted message on tiktok? Who wants that?? Data collection for Gen+ Translator is inspired by self-instruct .
Gen+ Translator is what can be considered as a distilled model. The definition of all slang words was taken from List of Generation Z slang - Wikipedia. Using a locally ran LLM mistral-nemo, precisely 6,000 datasets of translations were generated. These translations included the most common topics in day to day life (reference topics.json). Additionally, each translation included the slang words used.
These 6,000 datasets were then split into half, where half were english to slang and the other were slang to english.
Ver. 1 Gen+ Translator is finetuned based on gpt2-large and trained on 2x NVIDIA RTX 3090. The model was finetuned using PeFT and Lora to reduce computations.
Ver. 2 (current ver) Gen+ Translator is finetuned based on quantized-gptq llama2-7b and trained on 2x NVIDIA RTX 3090. The model was finetuned using PeFT and Lora to reduce computations.
- Create a python virtual environment (optional):
python -m venv venv
- Install the required dependencies:
pip install -r requirements.txt
- Change parameters in the .env (yes, I know I committed the .env, because it is being used as a config):
API_ENDPOINT="http://localhost:11434/api/generate" <- Using ollama for running model locally
MODEL_NAME="mistral-nemo:latest"
FINETUNE_MODEL="gpt2-large"
DEVICE="cuda:0"
- Generating data. Running data_generation.py will use the local LLM to generate 6,000 datasets with 10 threads. To change these parameters please modify the file.
python data_generation.py
- Run finetune.py. If your CUDA runs out of memory, adjust the training parameters inside the file.
python finetune.py
The finetuned Peft will be saved inside ./adapter
- Loading the model:
python en-to-slang.py <- loads the english to slang
python slang-to-en.py <- loads the slang to english