Motivated by the need for more effective and efficient training, we propose the Code Adaptive Compute-efficient Tuning (CodeACT) framework. CodeACT introduces the Complexity and Diversity Aware Sampling (CDAS) method to select high-quality training data based on complexity and diversity, and the Dynamic Pack padding strategy to reduce computational resource usage by minimizing padding tokens during training.
Experimental results demonstrate that CodeACT-DeepSeek-Coder-6.7B, fine-tuned on only 40% of the EVOL-Instruct data, achieves an 8.6% performance increase on HumanEval, reduces training time by 78%, and decreases peak GPU memory usage by 27%.
An overview of our proposed CDAS method, including three steps from top to bottom.
- Step 1: Clustering the EVOL-Instruct dataset to form multiple clusters.
- Step 2: Computing the Instruction-Following Difficulty score by comparing the model's perplexity with and without instructions.
- Step 3: Sampling the top m% instances from each re-ranked cluster to form a high-complexity sub-dataset that preserves data diversity.
Finally, we use the selected data for fine-tuning to obtain CodeACT-Coder.
Illustration of different padding strategies, where the blank squares represent padding tokens.
- Top: Traditional padding strategy aligns samples to the model's maximum input length, resulting in high computational resource consumption.
- Middle: Dynamic padding strategy reduces the number of padding tokens by aligning samples to the length of the longest sample in each batch.
- Bottom: Our proposed Dynamic Pack strategy sorts samples by length and concatenates multiple samples within a batch, further optimizing the utilization of the model's maximum input length and reducing padding tokens.
The CodeACT framework has been tested for its performance and efficiency on the OSS-Instruct and EVOL-Instruct datasets using the CodeLlama and DeepSeek-Coder models. The results show that CodeACT effectively optimizes computational resources while maintaining or improving model performance, highlighting its potential for efficient and effective model training.
The open-source models trained with CodeACT show significant performance improvements. Notably, CodeACT-DS-6.7B, trained on fewer data samples, surpasses models trained on larger datasets, showcasing the effectiveness of the CDAS method for data selection. These results indicate that CodeACT not only enhances model performance but also bridges the gap between open-source and closed-source models, positioning it as a valuable tool for advancing the capabilities of Code LLMs with optimized data selection and training processes.
- Python 3.11.9
- PyTorch 2.3.0
- CUDA 12.1
conda create -n codeact python=3.11
conda activate codeact
pip install -r requirements.txt
Before installing the following packages, please review the installation instructions and requirements on their respective repositories to ensure compatibility with your environment.
pip install deepspeed==0.14.2
pip install flash-attn==2.5.9.post1 --no-build-isolation
- EVOL-Instruct: An open-source implementation of Evol-Instruct as described in the WizardCoder Paper.
- OSS-Instruct: This dataset is generated by gpt-3.5-turbo-1106. More details can be found in Magicoder Paper.
Our proposed CDAS is rooted in the necessity of considering both the complexity and diversity of data to enhance model training efficiency and effectiveness.
To efficiently handle large datasets, we support processing data in chunks. Each chunk is defined by the index and nums parameters, which determine the range of data to be processed.
Note:
- For EVOL-Instruct dataset, use
instruction
for question_column_name andoutput
for answer_column_name. - For OSS-Instruct dataset, use
problem
for question_column_name andsolution
for answer_column_name.
# Calculate IFD scores for a single chunk with indices from 0 to 10,000
python CDAS/calculate_ifd.py \
--data_path ${data_path} \
--question_column ${question_column_name} \
--answer_column ${answer_column_name} \
--save_dir ${save_dir} \
--model_path ${model_path} \
--gpu ${gpu_id} \
--index 0 \
--nums 10000
After processing all chunks, merge the results using the following command:
# Merge all processed chunks
python CDAS/calculate_ifd.py \
--save_dir ${save_dir} \
--num_split ${num} \
--merge
2. Get embedding using all-mpnet-base-v2 model
python CDAS/get_embedding.py \
--data_path ${data_path} \
--save_dir ${save_dir} \
--batch_size 512 \
--model_path ${embedding_model_path} \
--gpu ${gpu_id}
python CDAS/select_data.py \
--data_path ${data_path} \
--npy_path ${npy_path} \
--save_path ${save_path} \
--ratio 0.4
Parameters:
- model_path: Path to the pretrained models.
- data_path: Path to the json training data.
- model_type: The type of model being used. Supported values are
codellama
anddeepseek
. - pad_mode: The padding mode to use. Supported values are
dynamic_pad
anddynamic_pack
. - output_dir: Path to save training files.
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
accelerate launch --config_file "configs/fsdp_config.yaml" train.py \
--model_name_or_path ${model_path} \
--use_fast_tokenizer True \
--use_flash_attn True \
--data_path ${data_path} \
--model_type ${model_type} \
--max_seq_len 2048 \
--pad_mode ${pad_mode} \
--output_dir ${output_dir} \
--overwrite_output_dir True \
--gradient_checkpointing True \
--num_train_epochs 3 \
--per_device_train_batch_size 16 \
--gradient_accumulation_steps 4 \
--learning_rate 5e-5 \
--lr_scheduler_type "cosine" \
--warmup_steps 15 \
--save_strategy "no" \
--bf16 True \
--tf32 True \
--logging_strategy "steps" \
--logging_steps 50
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
# LoRA
torchrun --nproc_per_node 8 train.py \
--model_name_or_path ${model_path} \
--use_fast_tokenizer True \
--use_flash_attn True \
--use_peft_lora True \
--data_path ${data_path} \
--model_type ${model_type} \
--max_seq_len 2048 \
--pad_mode ${pad_mode} \
--output_dir ${output_dir} \
--overwrite_output_dir True \
--gradient_checkpointing True \
--num_train_epochs 3 \
--per_device_train_batch_size 16 \
--gradient_accumulation_steps 4 \
--learning_rate 5e-5 \
--lr_scheduler_type "cosine" \
--warmup_steps 15 \
--save_strategy "no" \
--bf16 True \
--tf32 True \
--logging_strategy "steps" \
--logging_steps 50
# QLoRA
torchrun --nproc_per_node 8 train.py \
--model_name_or_path ${model_path} \
--use_fast_tokenizer True \
--use_flash_attn True \
--use_peft_lora True \
--use_4bit_quantization \
--data_path ${data_path} \
--model_type ${model_type} \
--max_seq_len 2048 \
--pad_mode ${pad_mode} \
--output_dir ${output_dir} \
--overwrite_output_dir True \
--gradient_checkpointing True \
--num_train_epochs 3 \
--per_device_train_batch_size 16 \
--gradient_accumulation_steps 4 \
--learning_rate 5e-5 \
--lr_scheduler_type "cosine" \
--warmup_steps 15 \
--save_strategy "no" \
--bf16 True \
--tf32 True \
--logging_strategy "steps" \
--logging_steps 50
This section details the evaluation of the model's performance on the HumanEval(+) and MBPP(+) metrics.
In this step, we use the model to generate code samples using greedy decoding. This involves feeding the model with input prompts and collecting the output generated by the model.
# Base Model
python benchmark/generate.py \
--dataset [humaneval|mbpp] \
--model_path ${model_path} \
--samples_dir ${samples_dir} \
--gpu ${gpu_id} \
--max_new_tokens 1024
# Instruction Model
python benchmark/generate.py \
--dataset [humaneval|mbpp] \
--model_path ${model_path} \
--samples_dir ${samples_dir} \
--gpu ${gpu_id} \
--max_new_tokens 1024 \
--instruct
After generating the code samples, the next step is to sanitize them. This involves cleaning up the generated code to ensure it is in a valid and executable format. Sanitization helps to remove any potential errors or inconsistencies that might affect the evaluation results.
python benchmark/sanitize.py --sample_file ${sample_file}
The final step is to evaluate the model's performance based on the sanitized samples.
python benchmark/evaluate.py --dataset [humaneval|mbpp] --samples ${sample_file}
This command will compute the metrics and provide a detailed report on the model's performance.
If you find CodeACT helpful, please cite it as follows:
@misc{lv2024codeactcodeadaptivecomputeefficient,
title={CodeACT: Code Adaptive Compute-efficient Tuning Framework for Code LLMs},
author={Weijie Lv and Xuan Xia and Sheng-Jun Huang},
year={2024},
eprint={2408.02193},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2408.02193},
}