This project is the Pytorch implementation for the paper An Empirical Study of the Imbalance Issue in Software Vulnerability Detection
.
- Dataset
- Source code for CodeBERT
- Source code for GraphCodeBERT
Python== 3.7
pytorch==1.7.1
torchvision==0.8.2
tree-sitter==0.20.1
transformers==4.24.0
tqdm
numpy
All datasets provide function-level source code. Three open-source repositories:
CodeXGlue provides the devign dataset.
Devign provides the ffmpeg and qemu datasets.
Lin2018 provides the Asterisk, FFmpeg, LibPNG, LibTIFF, Pidgin, and VLC datasets.
Each dataset includes the training, validation, and test sets (*_trian.jsonl, *_valid.jsonl, *_test.jsonl
).
For GraphCodeBERT, we need to build the tree-sitter to parse code snippets and extract variable names. Build tree-sitter using the following command:
cd graphcoderbert/python_parser/parser_folder
bash build.sh
CodeBERT and GraphCodeBERT use the same commands for training/test. We use CodeBERT as an example.
python run.py \
--do_train \
--training standard\
--data_root devign\
--project_name qemu\
--epochs 50 \
--evaluate_during_training \
--seed 123456
python run.py \
--do_eval \
--training standard\
--data_root devign\
--project_name qemu\
python run.py \
--do_test \
--training standard\
--data_root devign\
--project_name qemu\
Parameter setting:
- --training: the solution used to address the imbalance issue.
- Choices:
- standard: use the default setting of CodeBERT and GraphCodeBERT.
- weight: use the mean false error loss
- cbl: use the class-balanced loss
- augmentation: use the adversarial attack-based augmentation (re-sampled data are created in the dataset folder. You can also generate it by using the code in
dataset/function-level/identifyP/augment.py
)
- augmentation: use the adversarial attack-based augmentation (re-sampled data are created in the dataset folder. You can also generate it by using the code in
- down: use the random down-sampling
- focal: use the focal loss
- over: use the random over-sampling (re-sampled data are created in the dataset folder. You can also generate it by using the code in
dataset/function-level/identifyP/augment_du.py
) - threshold: use the threshold-moving
- Choices:
- data_root: the source of data
- Choices: codexglue, devign, lin2018
- project_name: the name of dataset
- Choices: please check the names in dataset/function-level/ for each source.