Skip to content

Yuejun-GUO/vulnerability-detection

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

software-vulnerability-detection-imbalance

This project is the Pytorch implementation for the paper An Empirical Study of the Imbalance Issue in Software Vulnerability Detection.

Project Overview

  1. Dataset
  2. Source code for CodeBERT
  3. Source code for GraphCodeBERT

Environment

 Python== 3.7
 pytorch==1.7.1
 torchvision==0.8.2
 tree-sitter==0.20.1
 transformers==4.24.0
 tqdm
 numpy

Dataset

All datasets provide function-level source code. Three open-source repositories:

CodeXGlue provides the devign dataset.

Devign provides the ffmpeg and qemu datasets.

Lin2018 provides the Asterisk, FFmpeg, LibPNG, LibTIFF, Pidgin, and VLC datasets.

Each dataset includes the training, validation, and test sets (*_trian.jsonl, *_valid.jsonl, *_test.jsonl).

Run

For GraphCodeBERT, we need to build the tree-sitter to parse code snippets and extract variable names. Build tree-sitter using the following command:

cd graphcoderbert/python_parser/parser_folder
bash build.sh

CodeBERT and GraphCodeBERT use the same commands for training/test. We use CodeBERT as an example.

Fine-tuning

python run.py \
    --do_train \
    --training standard\
    --data_root devign\
    --project_name qemu\
    --epochs 50 \
    --evaluate_during_training \
    --seed 123456 

Validation

python run.py \
    --do_eval \
    --training standard\
    --data_root devign\
    --project_name qemu\

Test

python run.py \
    --do_test \
    --training standard\
    --data_root devign\
    --project_name qemu\

Parameter setting:

  • --training: the solution used to address the imbalance issue.
    • Choices:
      • standard: use the default setting of CodeBERT and GraphCodeBERT.
      • weight: use the mean false error loss
      • cbl: use the class-balanced loss
        • augmentation: use the adversarial attack-based augmentation (re-sampled data are created in the dataset folder. You can also generate it by using the code in dataset/function-level/identifyP/augment.py)
      • down: use the random down-sampling
      • focal: use the focal loss
      • over: use the random over-sampling (re-sampled data are created in the dataset folder. You can also generate it by using the code in dataset/function-level/identifyP/augment_du.py)
      • threshold: use the threshold-moving
  • data_root: the source of data
    • Choices: codexglue, devign, lin2018
  • project_name: the name of dataset
    • Choices: please check the names in dataset/function-level/ for each source.

About

software vulnerability detection

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 99.9%
  • Shell 0.1%