This section is written based on our setup experience on Red Hat's Linux 7.5 and CentOS Linux release 7.5.1804. Please run on a Linux OS. It is strongly recommended to run with at least one GPU.
Before setting up our project we'd like to make sure you have some prerequisite installations and setups.
Ensure you have Anaconda3 installed, if not install Python 3.7 from Anaconda with the following steps:
- Install the list of dependencies described here
- Download the installer here. For example, you can use the
wget
command:wget https://repo.anaconda.com/archive/Anaconda3-2021.05-Linux-x86_64.sh
, then typechmod +x Anaconda3-2021.05-Linux-x86_64.sh
and runbash Anaconda3-2021.05-Linux-x86_64.sh
to complete the installation. - You may need to add anaconda directory to the PATH environment variable (e.g., you can add
export PATH="/path_to_anaconda/anaconda3/bin:$PATH"
to thebashrc
file).
- If you are using an HPC cluster run the following command to enable Python 3.7 with CUDA:
module load cuda/9.2 anaconda3/5.0.1-cuda92
- If you are using a local machine and have anaconda set up already run the following command
conda env create -f evil_env_gpu.yml
Upon completion activate it usingconda activate evil_env
. Alternatively to using our environment file, you can runconda install pytorch torchvision cudatoolkit=11.1 -c pytorch -c nvidia
followed bypip3 install -r requirements_gpu.txt --user
. After this, you can move to the Install Natural Language tools section.
- Move to the EVIL main directory
- It is recommended you use a virtual environment for the dependency set up (Conda environment). If you do not wish to do so, then simply run
pip3 install -r requirements.txt --user
.
-
Import our saved conda environment using the command:
conda env create -f evil_env.yml
and activate it usingsource activate evil_env
orconda activate evil_env
-
Alternatively, you can create an anaconda Python 3.7 virtual environment using the command
conda create -n yourenvname python=3.7 anaconda
. Activate the environment by typingsource activate yourenvname
. -
Run
pip3 install -r requirements.txt --user
to install the dependencies.
-
Install nltk tokenizers and corpora
python -m nltk.downloader
, then typed
(Download), and typeall
in Identifier. Typeq
at the end of the installation. -
Install the spacy language model by using the following command
python -m spacy download en_core_web_lg
This section briefly describes how to replicate the experiment mentioned in the paper. If you are using an anaconda environment, please ensure that your conda environment is activated before running any of the bash commands below.
To Launch the finetuning and evaluation processes of CodeBERT the basic command template is as follows:
bash CodeBERT_Launch.sh [DEVICE] [DATASET] [PREPROCESSING]
Device Options:
- Local machine
- HPC with a SLURM scheduler
- HPC with a TORQUE scheduler
Dataset Options:
- Python Encoder Dataset
- Assembly Decoder Dataset
Preprocessing Options:
- Raw corpus counts
- Preprocessing without the Intent Parser (IP)
- Preprocessing with the Intent Parser (IP)
- From the EVIL home directory, run
bash CodeBERT_Launch.sh 0 [DATASET] [PREPROCESSING]
- Navigate to
EVIL/model/fine_tune.slurm
and add in your GPU queue name under the TODO comment. - From the EVIL home directory, run
bash CodeBERT_Launch.sh 1 [DATASET] [PREPROCESSING]
- When the job is complete, from the EVIL home directory, run
bash evaluate.sh
- Note: If your cluster jobs do not connect to the internet you might want to run the bash script on the head node using the local machine option
bash CodeBERT_Launch.sh 0 [DATASET] [PREPROCESSING]
to download the models and terminate it before it gets to the training portion (you'll see a progress bar when right before the training starts).
- Navigate to
EVIL/model/fine_tune.pbs
and add in your GPU queue name under the TODO comment. - From the EVIL home directory, run
bash CodeBERT_Launch.sh 2 [DATASET] [PREPROCESSING]
- When the job is complete, from the EVIL home directory, run
bash evaluate.sh
- Note: If your cluster jobs do not connect to the internet you might want to run the bash script on the head node using the local machine option
bash CodeBERT_Launch.sh 0 [DATASET] [PREPROCESSING]
to download the models and terminate it before it gets to the training portion (you'll see a progress bar when right before the training starts).
The final evaluation results would appear on your console if you are running on your local machine and in the specified logging output directory if a job was submitted.
The predicted output will be generated in the subdirectory model/eval/[encoder/decoder]_test_output.json
.
To launch the training and evaluation of the Seq2Seq model mentioned in the paper also ensure the conda environment is active. The basic command template is as follows:
bash Seq2Seq_Launch.sh [DATASET] [PREPROCESSING]
The dataset and preprocessing options are the same as that of CodeBERT.
The final evaluation results would appear on your console if you are running on your local machine and in the specified logging output directory seq2seq/logs
The predicted output will be generated in the subdirectory seq2seq/archive/id-[timestamp]/answer_[encoder/decoder].txt
.
- Run
bash utils/test_split.sh
for details on the different preprocessing options - If you chose to submit a job, the logs will be stored in
model/job_logs/
, named with the job id. - Run
bash utils/test_split.sh [DATASET] 0
for raw corpus token counts