Team Members -
- Ankit Shibusam - ashibusa@andrew.cmu.edu
- Atharva Anand Joshi - atharvaa@andrew.cmu.edu
- Ketan Ramaneti - kramanet@andrew.cmu.edu
- Install the required dependencies on your local machine
pip install torch numpy transformers datasets tiktoken wandb tqdm pytorch-ignite
- First generate the train and val data by running data/openwebtext/prepare.py. This script would fetch the OpenWebText data and perform a train-val split, followed by sub-word level tokenization using tiktoken. Finally it saves the process train and val data in the data/ folder.
$ python3 data/openwebtext/prepare.py
- You can run the pretraining by simply running the train script. The configurations for the training can be set using the config.py file.
python3 train.py
- For the finetuning tasks set up the data by running the below command -
python data/cnn_dailymail/prepare.py
python data/squad/prepare.py
-
Set the right file names and the required config variables in finetune_config.py and config.py. The other fields can be left untouched, but the file paths will need to be modified.
-
The model trained checkpoints can be downloaded from this directory - https://drive.google.com/drive/folders/13nobcjJdx2svWk4mJ8Xj_gO3p9V9I4AZ?usp=sharing
-
Setup support for distributed training
-
Write code for sequential unfreezing.
NanaoGPT - https://github.com/karpathy/nanoGPT/tree/master This repository was referred to for creating the LLM model.