Skip to content

Files

Latest commit

author
jadecxliu
Jun 21, 2023
9705c1b · Jun 21, 2023

History

History

CodeExecutor

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
Jun 21, 2023
Jun 21, 2023
Jun 21, 2023
Jun 21, 2023

CodeExecutor

This repo provides the code for reproducing the experiments in Code Execution with Pre-trained Language Models. CodeExecutor is a pre-trained model that learns to predict the execution traces using a code execution pre-training task and curriculum learning.

The pre-trained checkpoint of CodeExecutor is available on Huggingface.

Our dataset is available on Zenodo.

1. Dependency

  • pip install pytorch
  • pip install transformers
  • pip install python-Levenshtein

2. Data

The Python Code Execution datasets are a series of datasets following an easy-to-hard paradigm, including the SingleLine dataset, Tutorial dataset, and CodeNetMut dataset. We provide each test set of the three on Zenodo.

Demo data (simplified version):

{
    "id": 0,  
    "code": "s = ['x', 'y', 'z']",  
    "code_tokens": ["<0>", "s", "=", "[", "'x'", ",", "'y'", ",", "'z'", "]"],  
    "trace": ["<line> <0> <state> s : [ x , y , z ] </state>"],
    "trace_tokens": ["<line>", "<0>", "<state>", "s", ":", "[", "x", ",", "y", ",", "z", "]", "</state>"]

}

We also construct a new dataset for the zero-shot code-to-code search task, by collecting 9,987 Python functions from CodeNet. Each function solves one of the 48 problems.

Demo data (simplified version):

{
    "id": 0,  
    "code_id": "s204511158", 
    "problem_id": 340, # solve which problem
    "original_code": "s = list(input())", # code without providing the test case
    "code": "s = ['x', 'y', 'z']",  # code provided with a test case
    "code_tokens": ["<0>", "s", "=", "[", "'x'", ",", "'y'", ",", "'z'", "]"],  
    "trace": ["<line> <0> <state> s : [ x , y , z ] </state>"],
    "trace_tokens": ["<line>", "<0>", "<state>", "s", ":", "[", "x", ",", "y", ",", "z", "]", "</state>"]
}

3. Pre-training

# prepare model checkpoint and datasets
cd pretrain
bash run.sh

A demo bash script (run.sh) is shown:

# Change the arguments as required:
#   output_dir: the output directory to save inference results
#   data_cache_dir: the output directory to save the data cache 
#   train_data_path: the path of the pre-training file
#   eval_data_path: the path of the test file
#   model_name_or_path: the path of the model to be evaluated

PER_NODE_GPU=8
python -m torch.distributed.launch --nproc_per_node=${PER_NODE_GPU} run.py \
    --output_dir ../saved_models/pretrain_codeexecutor_stage_3 \
    --data_cache_dir ../saved_models/pretrain_codeexecutor_stage_3 \
    --train_data_path /drive/pretrain_codenetmut.json \
    --another_train_data_path /drive/pretrain_tutorial.json \
    --third_train_data_path /drive/single_line_hard_3_million.json \
    --eval_data_path ../data/codenetmut_test.json \
    --model_name_or_path ../saved_models/pretrain_codeexecutor_stage_2 \
    --block_size 1024 \
    --per_gpu_train_batch_size 4 \
    --per_gpu_eval_batch_size 8 \
    --gradient_accumulation_steps 8 \
    --learning_rate 4e-4 \
    --node_index=0 \
    --gpu_per_node $PER_NODE_GPU \
    --weight_decay 0.01 \
    --adam_epsilon 1e-6 \
    --max_grad_norm 1.0 \
    --max_steps 1000000 \
    --warmup_steps 10000 \
    --save_steps 5000 \
    --seed 123

3. Inference

Please download the datasets first. Unzip it and move it to ./data.

# prepare model checkpoint and datasets
cd inference
bash run.sh

A demo bash script (run.sh) is shown:

# Change the arguments as required:
#   prefix: dataset type (codenet/tutorial/singleline)
#   output_dir: the output directory to save inference results
#   data_cache_dir: the output directory to save the data cache 
#   eval_data_path: the path of the test file
#   model_name_or_path: the path of the model to be evaluated

CUDA_VISIBLE_DEBVISES=0 python run.py \
    --prefix codenet \
    --output_dir ../../saved_models/inference \
    --data_cache_dir ../../saved_models/inference \
    --eval_data_path ../data/codenetmut_test.json \
    --model_name_or_path microsoft/codeexecutor \
    --block_size 1024 \
    --per_gpu_train_batch_size 8 \
    --per_gpu_eval_batch_size 16 \
    --gradient_accumulation_steps 8 \
    --learning_rate 1e-4 \
    --node_index 0 \
    --weight_decay 0.01 \
    --adam_epsilon 1e-6 \
    --max_grad_norm 1.0 \
    --max_steps 1000 \
    --warmup_steps 10000 \
    --save_steps 5000 \
    --seed 123456

4. Downstream tasks

We apply CodeExecutor on code intelligence tasks, such as the Zero-shot Code-to-code Search task. Here, we provide example code in which the baseline model is UniXcoder.

First, generate traces for the code-to-code search test set. We provide the prediction file code_to_code_search_preds.txt on Zenodo.

Or use the following script to generate the prediciton file (will be ../saved_models/code_to_code_search/preds.txt).

# prepare model checkpoint and datasets
cd inference

CUDA_VISIBLE_DEBVISES=0 python run.py \
    --prefix codenet \
    --output_dir ../saved_models/code_to_code_search \
    --data_cache_dir ../saved_models/code_to_code_search \
    --eval_data_path ../data/code_to_code_search_test.json \
    --model_name_or_path microsoft/codeexecutor \
    --block_size 1024 \
    --per_gpu_train_batch_size 8 \
    --per_gpu_eval_batch_size 16 \
    --gradient_accumulation_steps 8 \
    --learning_rate 1e-4 \
    --node_index 0 \
    --weight_decay 0.01 \
    --adam_epsilon 1e-6 \
    --max_grad_norm 1.0 \
    --max_steps 1000 \
    --warmup_steps 10000 \
    --save_steps 5000 \
    --seed 123456

Second, utilize the program outputs extracted from the execution trace generated by CodeExecutor to facilitate the code-to-code search task.

cd downstream
bash run.sh

A demo bash script (run.sh) is shown:

# Change the arguments as required:
#   trace_file: the path to the prediction file either downloaded or generated in the last step

source_lang=python
target_lang=python
python run.py \
    --model_name_or_path microsoft/unixcoder-base  \
    --query_data_file ../data/code_to_code_search_test.json \
    --candidate_data_file ../data/code_to_code_search_test.json \
    --trace_file ../data/code_to_code_search_preds.txt \
    --query_lang ${source_lang} \
    --candidate_lang ${target_lang} \
    --code_length 512 \
    --eval_batch_size 256 

Reference

If you use this code or CodeExecutor, please consider citing us.

@article{liu2023code,
  title={Code Execution with Pre-trained Language Models},
  author={Liu, Chenxiao and Lu, Shuai and Chen, Weizhu and Jiang, Daxin and Svyatkovskiy, Alexey and Fu, Shengyu and Sundaresan, Neel and Duan, Nan},
  journal={arXiv preprint arXiv:2305.05383},
  year={2023}
}