-
Notifications
You must be signed in to change notification settings - Fork 503
Training Details
The training script is in scripts
directory:
- First stage: PT (Continue PreTraining)
run_pt.sh
- Second stage: SFT (Supervised Fine-tuning)
run_sft.sh
- Third stage: RM (Reward Model) reward model
run_rm.sh
- Fourth stage: RL (Reinforcement Learning) reinforcement learning based on human feedback
run_rl.sh
- If you want to train on a single card, you only need to set nproc_per_node to 1, or remove the torchrun command and run the python script directly, such as
python scripts/run_supervised_finetuning.py
- The default pre-training model is LLaMA, and the training code is also compatible with GPT models such as ChatGLM-6B/BLOOM,
model_name_or_path
just adjust - Specify the training set,
--train_file_dir
specify the training data directory, and--validation_file_dir
specify the verification data directory. If not specified, the--dataset_name
specified HF datasets dataset will be used by default. See the dataset format for the field format of the training set. It is recommended to add some general dialogue data to the domain training set. For the link of the dataset, see📚 Dataset - If the operating environment supports deepspeed, add
--deepspeed deepspeed_config.json
- If the gpu supports int8, plus --load_in_8bit Truethe representative adopts 8bit quantization training, it can significantly reduce the memory usage
- Debug the model,
--max_train_samples
and--max_eval_samples
specify the maximum number of samples for the training and validation datasets to quickly verify whether the code is available. Please delete these two parameters or set them to -1 during training
By default, LoRA training is used. The LoRA model weights of each stage need to be merged into the base model. Use the following command to merge, and the next stage is model_name_or_path
designated as the merged model folder.
LoRA layers were using at all stages to reduce memory requirements. At each stage the peft adapter layers were merged with the base model, using:
python scripts/merge_peft_adapter.py \
--base_model_name_or_path base_model_dir \
--peft_model_path lora_model_dir \
--output_dir outputs-merged
- this script requires peft>=0.3.0
- The merged weights are saved in the output_dir directory, and can be directly loaded later by from_pretrained
The training logs and models are saved in the output_dir directory, and the file structure in the directory is as follows:
output_dir/
|-- adapter_config.json
|-- adapter_model.bin
|-- checkpoint-24000
| |-- adapter_config.json
| |-- adapter_model.bin
| |-- trainer_state.json
| `-- training_args.bin
|-- train_results.txt
|-- eval_results.txt
|-- special_tokens_map.json
|-- tokenizer_config.json
|-- training_args.bin
|-- logs
| |-- 1685436851.18595
| | `-- events.out.tfevents.1685436851.ts-89f5028ad154472e99e7bcf2c9bf2343-launcher.82684.1
└── config.json
-
trainer_state.json
Changes in loss and learning_rate are recorded - The files in the logs directory can be used for tensorboard visualization. The command to start tensorboard is as follows:
tensorboard --logdir output_dir/logs --host 0.0.0.0 --port 8008
The parameter configuration of deepspeed deepspeed_config.jsoncan refer to:
- https://www.deepspeed.ai/docs/config-json/
- https://huggingface.co/docs/accelerate/usage_guides/deepspeed
- https://github.com/huggingface/transformers/blob/main/tests/deepspeed
If the video memory is sufficient, stage 2 can be given priority, and the corresponding configuration file is deepspeed_config.json. If the video memory is insufficient, you can use stage 3, which uses model parameters in parallel, which can significantly reduce the video memory usage, but the training speed will be much slower.
Take two machines as an example, each machine has 8 cards
node_rank=$1
echo ${node_rank}
master_addr="10.111.112.223"
torchrun --nproc_per_node 8 --nnodes 2 --master_addr ${master_addr} --master_port 14545 --node_rank ${node_rank} srcipts/run_supervised_finetuning.py ...
-
node_rank
represents the rank of the node, the node_rank of the first machine (main machine) is set to 0, and the node_rank of the second - machine is set to 1 -
nnodes
represents the number of node machines -
master_addr
represents the ip address of the master machine -
master_port
represents the port number for communicating with the master machine