Skip to content

[ICML'24] The official implementation of “Rethinking Optimization and Architecture for Tiny Language Models”

Notifications You must be signed in to change notification settings

YuchuanTian/RethinkTinyLM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 

Repository files navigation

[ICML'24] Rethinking Optimization and Architecture for Tiny Language Models

This is the official implementation of Rethinking Optimization and Architecture for Tiny Language Models, an empirical investigation about how to construct powerful language models.

Four strategies are proposed to improve performance:

  • 🎯 Compact Tokenizer: efficient coverage of corpus;
  • 🔍 Architecture Tweak: better depth and width tradeoffs;
  • 🎁 Parameter Inheritance: powerful knowledge from larger LLMs;
  • 🔥 Multiple-Round Training: memory reinforcement of tiny models.

Based on the observations above, PanGu-π-1B Pro and PanGu-π-1.5B Pro are trained on 1.6T multilingual corpora. Model configurations are shown as follows:

Benchmark Results

Training

This repository is modified from the InternEvo training framework.

Here are the steps to organize the codes:

  1. Clone the InternEvo repository and configure the runtime environment.
  2. Copy the configuration files configs/LLM1B.py to the InternEvo/configs/ directory.
  3. Copy the start script src/start_finetune.py to the InternEvo root directory.

You can follow the guide of InternEvo to pretrain data and train models (https://github.com/InternLM/InternEvo/blob/develop/doc/en/usage.md).

The model's depth, width, and expanding rate can by easily adjusted in the config.

Compact Tokenizer

The compact tokenizer is constructed by removing low-frequency vocabularies. To prune tokenizer, you can follow these steps:

  1. Counting the frequency of tokens cached by the original big tokenizer.
    • python src/step1_token_frequency_stat.py --src cached_data_dir --dst tmp_stat_files_dir. Thhe script counts the frequency of all tokens in the cached_data_dir folder and generates a corresponding JSON file in the tmp_stat_files_dir folder.
    • python src/step2_token_frequency_stat_combie.py --src tmp_stat_files_dir --dst total_token_freq.json. Combine all JSON files in the tmp_stat_files_dir folder and write the frequency of tokens in total_token_freq.json
  2. Firstly add the special tokens, and then add the tokens with the highest word frequency to the new tokenizer.
    • python src/step3_generate_new_tokenizer.py --origin_tokenizer_dir origin_tokenzier --vocab_num compact_tokenizer_size --output new_tokenizer_dir --token_freq_file total_token_freq.json. This script will generate a new tokenzier in the new_tokenizer_dir folder with compact_tokenizer_size tokens.

Parameter Inheritance

To pretrain by inheriting parameter from a large model, you can use the following command:

python start_finetune.py --config ./configs/LLM1B.py

Note that MODEL_ONLY_FOLDER is the model's checkpoint pruned from a large model.

If you want to train from scratch, you need the set load_given_ckpt=False in the config.

Multiple-Round Training

To extract a certain proportion of challenging examples from the last epoch, you can utilize the following steps:

  1. Compute the batch-wise loss $L={l_1,l_2,\cdots,l_N}$ using the pre-trained frozen model from the previous epoch, where $N$ represents the total number of batches. For instance, a dataset containing 150B tokens might yield approximately 75000 batches when utilizing a batch size of 2M.
  2. Calculate the sampling probability $p_i = \exp(l_i) \bigg/ {\sum \limits_{j=1}^N \exp(l_j)}$.
  3. Sample $N_0$ batches out of $N$ according to the sampling probability $\boldsymbol{p}$, i.e., filtered = torch.multinomial(p, N_0, replacement=False)
  4. Concatenate all the filtered batches to create the training dataset for the subsequent epoch.

Inference

Convert the model weight to Hugging Face format using the script tools/transformers/convert2hf.py.

python tools/transformers/convert2hf.py --src_folder origin_ckpt/ --tgt_folder hf_ckpt/ --tokenizer tokenizer_path/

Then the model can be inferred with Hugging Face.

Acknowledgements

Citation

@article{tang2024rethinking,
  title={Rethinking Optimization and Architecture for Tiny Language Models},
  author={Tang, Yehui and Liu, Fangcheng and Ni, Yunsheng and Tian, Yuchuan and Bai, Zheyuan and Hu, Yi-Qi and Liu, Sichao and Jui, Shangling and Han, Kai and Wang, Yunhe},
  journal={arXiv preprint arXiv:2402.02791},
  year={2024}
}

@article{wang2023pangu,
  title={PanGu-$$\backslash$pi $: Enhancing Language Model Architectures via Nonlinearity Compensation},
  author={Wang, Yunhe and Chen, Hanting and Tang, Yehui and Guo, Tianyu and Han, Kai and Nie, Ying and Wang, Xutao and Hu, Hailin and Bai, Zheyuan and Wang, Yun and others},
  journal={arXiv preprint arXiv:2312.17276},
  year={2023}
}

About

[ICML'24] The official implementation of “Rethinking Optimization and Architecture for Tiny Language Models”

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages