This directory contains the code for continued pretraining.
Create a conda environment:
conda create -n trainenv python=3.10
Activate the environment:
conda activate trainenv
Install torch from PyTorch.
Install the required packages:
pip install -r train/requirements.txt
Modify in_dirs
in scripts/get_files_to-be-tokenized.py
. Exch element of in_dirs
should be the path to a directory that contains .jsonl
files with the texts to be trainined under the key "text"
. Then run:
python train/scripts/get_files_to-be-tokenized.py
You can modify out_dir
in scripts/tokenize_data.py
to change the output directory. Then run:
python train/scripts/tokenize_data.py
You can modify in_dirs
in scripts/get_files_to-be-trained.py
to change the paths to the directories that contains the tokenized parquet files. Each element should be a path to a directory that contains parquet files created in Step 2. Then run:
python train/scripts/get_files_to-be-trained.py
You can modify max_len
in scripts/group_text.py
to change the context length. This context length should equal the context length you wish to use in training. Then run:
python train/scripts/group_text.py
This outputs a single parquet file containing token indexes grouped into the given context length.
You can modify the training configs in scripts/train.py
. The example script should be run on a cluster of 8 nodes with 4 A800 GPUs on each node. The cluster we used adpots an All Reduce-DDP structure, with $WORLD_SIZE
, $RANK
and $MASTER_PORT
automatically configured. You may need to modify the script so that it runs on your cluster. Run:
bash train/scripts/train.sh
This script trains the model for 3 epochs with a batch size of 512 (8 node x 4 gpu_per_node x 4 per_device_train_batch_size x 4 gradient_accumulation_steps).