Skip to content

The CodeACT framework is designed to address both the quality of training data and the efficiency of the fine-tuning process for Code LLMs.

Notifications You must be signed in to change notification settings

Kyle-Lyu/CodeACT

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CodeACT: Code Adaptive Compute-efficient Tuning Framework for Code LLMs

Paper

Contents

Overview

Motivated by the need for more effective and efficient training, we propose the Code Adaptive Compute-efficient Tuning (CodeACT) framework. CodeACT introduces the Complexity and Diversity Aware Sampling (CDAS) method to select high-quality training data based on complexity and diversity, and the Dynamic Pack padding strategy to reduce computational resource usage by minimizing padding tokens during training.

Experimental results demonstrate that CodeACT-DeepSeek-Coder-6.7B, fine-tuned on only 40% of the EVOL-Instruct data, achieves an 8.6% performance increase on HumanEval, reduces training time by 78%, and decreases peak GPU memory usage by 27%.

Components

Complexity and Diversity Aware Sampling

Overview of CDAS method

An overview of our proposed CDAS method, including three steps from top to bottom.

  • Step 1: Clustering the EVOL-Instruct dataset to form multiple clusters.
  • Step 2: Computing the Instruction-Following Difficulty score by comparing the model's perplexity with and without instructions.
  • Step 3: Sampling the top m% instances from each re-ranked cluster to form a high-complexity sub-dataset that preserves data diversity.

Finally, we use the selected data for fine-tuning to obtain CodeACT-Coder.

Dynamic Pack

Illustration of different padding strategies

Illustration of different padding strategies, where the blank squares represent padding tokens.

  • Top: Traditional padding strategy aligns samples to the model's maximum input length, resulting in high computational resource consumption.
  • Middle: Dynamic padding strategy reduces the number of padding tokens by aligning samples to the length of the longest sample in each batch.
  • Bottom: Our proposed Dynamic Pack strategy sorts samples by length and concatenates multiple samples within a batch, further optimizing the utilization of the model's maximum input length and reducing padding tokens.

Results

RQ1: How does the CodeACT framework perform across different datasets and models?

The CodeACT framework has been tested for its performance and efficiency on the OSS-Instruct and EVOL-Instruct datasets using the CodeLlama and DeepSeek-Coder models. The results show that CodeACT effectively optimizes computational resources while maintaining or improving model performance, highlighting its potential for efficient and effective model training.

Experiment 1

RQ2: How does the performance of models trained with CodeACT compare to other models?

The open-source models trained with CodeACT show significant performance improvements. Notably, CodeACT-DS-6.7B, trained on fewer data samples, surpasses models trained on larger datasets, showcasing the effectiveness of the CDAS method for data selection. These results indicate that CodeACT not only enhances model performance but also bridges the gap between open-source and closed-source models, positioning it as a valuable tool for advancing the capabilities of Code LLMs with optimized data selection and training processes.

Experiment 2

Usage

Installation

1. Install required packages

  • Python 3.11.9
  • PyTorch 2.3.0
  • CUDA 12.1
conda create -n codeact python=3.11
conda activate codeact
pip install -r requirements.txt

2. Install additional packages for training

Before installing the following packages, please review the installation instructions and requirements on their respective repositories to ensure compatibility with your environment.

pip install deepspeed==0.14.2
pip install flash-attn==2.5.9.post1 --no-build-isolation

Datasets

Data Selection

Our proposed CDAS is rooted in the necessity of considering both the complexity and diversity of data to enhance model training efficiency and effectiveness.

1. Calculate IFD scores

To efficiently handle large datasets, we support processing data in chunks. Each chunk is defined by the index and nums parameters, which determine the range of data to be processed.

Note:

  • For EVOL-Instruct dataset, use instruction for question_column_name and output for answer_column_name.
  • For OSS-Instruct dataset, use problem for question_column_name and solution for answer_column_name.
# Calculate IFD scores for a single chunk with indices from 0 to 10,000
python CDAS/calculate_ifd.py \
--data_path ${data_path} \
--question_column ${question_column_name} \
--answer_column ${answer_column_name} \
--save_dir ${save_dir} \
--model_path ${model_path} \
--gpu ${gpu_id} \
--index 0 \
--nums 10000

After processing all chunks, merge the results using the following command:

# Merge all processed chunks
python CDAS/calculate_ifd.py \
--save_dir ${save_dir} \
--num_split ${num} \
--merge

2. Get embedding using all-mpnet-base-v2 model

python CDAS/get_embedding.py \
--data_path ${data_path} \
--save_dir ${save_dir} \
--batch_size 512 \
--model_path ${embedding_model_path} \
--gpu ${gpu_id}

3. Sample data

python CDAS/select_data.py \
--data_path ${data_path} \
--npy_path ${npy_path} \
--save_path ${save_path} \
--ratio 0.4

Training

Parameters:

  • model_path: Path to the pretrained models.
  • data_path: Path to the json training data.
  • model_type: The type of model being used. Supported values are codellama and deepseek.
  • pad_mode: The padding mode to use. Supported values are dynamic_pad and dynamic_pack.
  • output_dir: Path to save training files.

1. Fully Sharded Data Parallel (FSDP)

export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7

accelerate launch --config_file "configs/fsdp_config.yaml"  train.py \
--model_name_or_path ${model_path} \
--use_fast_tokenizer True \
--use_flash_attn True \
--data_path  ${data_path} \
--model_type ${model_type} \
--max_seq_len 2048 \
--pad_mode ${pad_mode} \
--output_dir ${output_dir} \
--overwrite_output_dir True \
--gradient_checkpointing True \
--num_train_epochs 3 \
--per_device_train_batch_size 16 \
--gradient_accumulation_steps 4 \
--learning_rate 5e-5 \
--lr_scheduler_type "cosine" \
--warmup_steps 15 \
--save_strategy "no" \
--bf16 True \
--tf32 True \
--logging_strategy "steps" \
--logging_steps 50

2. Parameter Efficient FineTuning (PEFT)

export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7

# LoRA
torchrun --nproc_per_node 8 train.py \
--model_name_or_path ${model_path} \
--use_fast_tokenizer True \
--use_flash_attn True \
--use_peft_lora True \
--data_path  ${data_path} \
--model_type ${model_type} \
--max_seq_len 2048 \
--pad_mode ${pad_mode} \
--output_dir ${output_dir} \
--overwrite_output_dir True \
--gradient_checkpointing True \
--num_train_epochs 3 \
--per_device_train_batch_size 16 \
--gradient_accumulation_steps 4 \
--learning_rate 5e-5 \
--lr_scheduler_type "cosine" \
--warmup_steps 15 \
--save_strategy "no" \
--bf16 True \
--tf32 True \
--logging_strategy "steps" \
--logging_steps 50

# QLoRA
torchrun --nproc_per_node 8 train.py \
--model_name_or_path ${model_path} \
--use_fast_tokenizer True \
--use_flash_attn True \
--use_peft_lora True \
--use_4bit_quantization \
--data_path  ${data_path} \
--model_type ${model_type} \
--max_seq_len 2048 \
--pad_mode ${pad_mode} \
--output_dir ${output_dir} \
--overwrite_output_dir True \
--gradient_checkpointing True \
--num_train_epochs 3 \
--per_device_train_batch_size 16 \
--gradient_accumulation_steps 4 \
--learning_rate 5e-5 \
--lr_scheduler_type "cosine" \
--warmup_steps 15 \
--save_strategy "no" \
--bf16 True \
--tf32 True \
--logging_strategy "steps" \
--logging_steps 50

Evaluation

This section details the evaluation of the model's performance on the HumanEval(+) and MBPP(+) metrics.

1. Generate samples

In this step, we use the model to generate code samples using greedy decoding. This involves feeding the model with input prompts and collecting the output generated by the model.

# Base Model
python benchmark/generate.py \
--dataset [humaneval|mbpp] \
--model_path ${model_path} \
--samples_dir ${samples_dir} \
--gpu ${gpu_id} \
--max_new_tokens 1024

# Instruction Model
python benchmark/generate.py \
--dataset [humaneval|mbpp] \
--model_path ${model_path} \
--samples_dir ${samples_dir} \
--gpu ${gpu_id} \
--max_new_tokens 1024 \
--instruct

2. Sanitize samples

After generating the code samples, the next step is to sanitize them. This involves cleaning up the generated code to ensure it is in a valid and executable format. Sanitization helps to remove any potential errors or inconsistencies that might affect the evaluation results.

python benchmark/sanitize.py --sample_file ${sample_file}

3. Evaluate performance

The final step is to evaluate the model's performance based on the sanitized samples.

python benchmark/evaluate.py --dataset [humaneval|mbpp] --samples ${sample_file}

This command will compute the metrics and provide a detailed report on the model's performance.

Citation

If you find CodeACT helpful, please cite it as follows:

@misc{lv2024codeactcodeadaptivecomputeefficient,
      title={CodeACT: Code Adaptive Compute-efficient Tuning Framework for Code LLMs}, 
      author={Weijie Lv and Xuan Xia and Sheng-Jun Huang},
      year={2024},
      eprint={2408.02193},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2408.02193}, 
}

About

The CodeACT framework is designed to address both the quality of training data and the efficiency of the fine-tuning process for Code LLMs.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages