🤗 [Model Download] • 📈 [Analysis and Results] • 📗 Pretraining Dataset
This repository contains the training code for CrystalCoder, a 7B-parameter language model pretrained on code and natural language.
The training of Phase 1-3 is completed using Cerebras's Model Zoo on Cerebras CS-2 hardware.
To launch the training, you can use the following command:
git clone https://github.com/Cerebras/modelzoo
cd modelzoo
python modelzoo/transformers/pytorch/gpt3/run.py CSX \
--mode train \
--num_csx 16 \
--num_workers_per_csx 1 \ # only needed for rel-1.9, not needed for rel-2.0
--params <path to params.yaml> \
--python_paths <path to modelzoo> \
--mount_dirs <mounts needed for modelzoo and data sets> \
--model_dir <path to model dir> \
--checkpoint_path <path to initalization checkpoint> \
In the script, you need to specify different --params
for different phases. The params files we used for each phase are in the params
folder.
The --checkpoint_path
is optional, and you can use it to specify the initialization checkpoint for the training. For example, for phase 2, you need to use the last checkpoint of phase 1 as the initialization checkpoint.
The processed dataset for each phase is available at CrystalCoderDatasets.
If you plan to process the dataset yourself from scratch, you can refer to our CrystalCoder data preprocessing code.
The code for instruction tuning is in the finetune
folder.
This is a modified version of Megatron-LM with the addition of muP and instruction dataset support to train CrystalChat.
To launch the distributed training, run rank_cmd_phase2.sh
on each node.
To convert the checkpoint from Huggingface format to Megatron-LM you can use the following command:
python Megatron-LM/tools/checkpoint/util.py --model-type GPT --loader crystalcoder_hf --saver megatron --target-tensor-parallel-size 2 --target-pipeline-parallel-size 4 --load-dir path/to/CrystalCoder_phase2_checkpoint_214387_to_hf --save-dir checkpoints/meg/phase2_tp2_pp4_dev
After the training, to convert the checkpoint from Megatron-LM to Huggingface format you can use the following command:
python Megatron-LM/tools/checkpoint/util.py --model-type GPT --loader megatron --saver crystalcoder_hf --load-dir checkpoints/to/megatron/checkpoint --save-dir checkpoints/to/hf/model --max-queue-size=5 --hf-config-path CrystalCoder
Go to README.old.md for the original README.
BibTeX:
@misc{liu2023llm360,
title={LLM360: Towards Fully Transparent Open-Source LLMs},
author={Zhengzhong Liu and Aurick Qiao and Willie Neiswanger and Hongyi Wang and Bowen Tan and Tianhua Tao and Junbo Li and Yuqi Wang and Suqi Sun and Omkar Pangarkar and Richard Fan and Yi Gu and Victor Miller and Yonghao Zhuang and Guowei He and Haonan Li and Fajri Koto and Liping Tang and Nikhil Ranjan and Zhiqiang Shen and Xuguang Ren and Roberto Iriondo and Cun Mu and Zhiting Hu and Mark Schulze and Preslav Nakov and Tim Baldwin and Eric P. Xing},
year={2023},
eprint={2312.06550},
archivePrefix={arXiv},
primaryClass={cs.CL}
}