Skip to content

text-machine-lab/mini_gpt

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

23 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Emergent Abilities in Reduced-Scale Generative Language Models

This repository has code to filter data from exisiting corpora based on child vocabulary and train small language models on this filtered data.

Installation

git clone git@github.com:text-machine-lab/mini_gpt.git
cd mini_gpt
pip install -r requirements.txt

Usage

The tokenizer for filtering is from filter_vocab_cpp.
Compile the C++ based filtration code and copy over the object fileto the src directory.

The vocabulary used for simplification of pre-training data can be found in data/AOChildes_word_frequency.csv This vocabulary is based on child-directed speech transcripts that can be found here

Downloading the SlimPajama dataset using git lfs:

git lfs install
git clone https://huggingface.co/datasets/cerebras/SlimPajama-627B

Chunks 1-10 are downloaded when this command is run.

For gathering the unfiltered dataset:

python SlimPajama_unfiltered.py

For Vocab filtering, use the following command per chunk:

python SlimPajama_filtering.py  --chunk_id 1

Pre-training data:

The pre-training data which consits of vocabulary filtered SlimPajama dataset can be found here 22B and 2.1B

To train BPE tokenizer:

python create_tokenizer.py --dataset_path ./dataset \
    --vocab_size 15_000 \
    --save_dir ./tokenizer

For pre-training a language model with distributed training:

python -u -m accelerate.commands.launch main.py \
     --lr 2.8e-3 --num_warmup_steps 1000 --num_layers 8 \
     --hidden_size 32 --use_tokenizer filtered \
     --chkpt_dir ../models/SlimPajama_Nov23_context128_vocab_21k/filtered/hidden_32_num_layer_8_int_128 \
     --int_size 128 --rope_theta 20

Notebooks

For creating the minigpt dataset, counting the number of tokens in the filtered dataset, analysis of the filtered dataset use the notebook notebooks/2.0-dataset-statistics.ipynb

For applying position interpolation on pre-trained models use the notebook notebooks/1.0-rope-pi.ipynb

To filter downstream evaluation datasets based on AO-Childes vocabulary use the notebook notebooks/4.0-dataset_filtering.ipynb

To get generations from the pre-trained and baseline models use notebooks/3.0-model-generations.ipynb

Citation

@misc{muckatira2024emergent,
      title={Emergent Abilities in Reduced-Scale Generative Language Models},
      author={Sherin Muckatira and Vijeta Deshpande and Vladislav Lialin and Anna Rumshisky},
      year={2024},
      eprint={2404.02204},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published