The main entry-point script is ns-slicer/src/run.py
. It has the following:
-
Required arguments:
Argument Description --data_dir
Path to directory containing datasets --output_dir
Path to save model outputs -
Experiment arguments:
Argument Default Description --model_key
microsoft/graphcodebert-base Huggingface pre-trained model key --pretrain
False Freeze pre-trained model layers --save_predictions
False Cache model predictions during evaluation --use_statement_ids
False Use statement ids in input embeddings --pooling_strategy
mean Pooling strategy --pct
0.15 Percentage of code to strip to simulate partial code --load_model_path
None Path to trained model: Should contain the .bin files --do_train
False Whether to run training --do_eval
False Whether to run eval on the dev set --do_eval_partial
False Whether to run eval on the partial snippets from dev set --do_eval_aliasing
False Whether to run variable aliasing on dev set --do_predict
False Whether to predict on given dataset -
Optional arguments:
Argument Default Description --max_tokens
512 Maximum number of tokens in a code example --train_batch_size
64 Batch size per GPU/CPU for training --eval_batch_size
64 Batch size per GPU/CPU for evaluation --learning_rate
1e-4 The initial learning rate for Adam optimizer --weight_decay
0.0 Weight decay for Adam optimizer --adam_epsilon
1e-8 Epsilon for Adam optimizer --num_train_epochs
5 Total number of training epochs to perform --seed
42 Random seed for initialization
Follow these instructions:
-
NS-Slicer with CodeBERT (off-the-shelf)
- Training:
python run.py --data_dir ../data --output_dir ../models/codebert-pt --model_key microsoft/codebert-base --pretrain --do_train
- Inference:
python run.py --data_dir ../data --output_dir ../models/codebert-pt --load_model_path ../models/codebert-pt/Epoch_4/model.ckpt --model_key microsoft/codebert-base --pretrain --do_eval
- Training:
-
NS-Slicer with GraphCodeBERT (off-the-shelf)
- Training:
python run.py --data_dir ../data --output_dir ../models/graphcodebert-pt --model_key microsoft/graphcodebert-base --pretrain --do_train
- Inference:
python run.py --data_dir ../data --output_dir ../models/graphcodebert-pt --load_model_path ../models/graphcodebert-pt/Epoch_4/model.ckpt --model_key microsoft/graphcodebert-base --pretrain --do_eval
- Training:
-
NS-Slicer with CodeBERT
- Training:
python run.py --data_dir ../data --output_dir ../models/codebert-ft --model_key microsoft/codebert-base --do_train
- Inference:
python run.py --data_dir ../data --output_dir ../models/codebert-ft --load_model_path ../models/codebert-ft/Epoch_4/model.ckpt --model_key microsoft/codebert-base --do_eval
- Training:
-
NS-Slicer with GraphCodeBERT
- Training:
python run.py --data_dir ../data --output_dir ../models/graphcodebert-ft --model_key microsoft/graphcodebert-base --do_train
- Inference:
python run.py --data_dir ../data --output_dir ../models/graphcodebert-ft --load_model_path ../models/graphcodebert-ft/Epoch_4/model.ckpt --model_key microsoft/graphcodebert-base --do_eval
- Training:
- Omitting 5% of the statements at both start and end
python run.py --data_dir ../data --output_dir ../models/graphcodebert-ft --load_model_path ../models/graphcodebert-ft/Epoch_4/model.ckpt --model_key microsoft/graphcodebert-base --do_eval_partial --pct 0.05
- Omitting 10% of the statements at both start and end
python run.py --data_dir ../data --output_dir ../models/graphcodebert-ft --load_model_path ../models/graphcodebert-ft/Epoch_4/model.ckpt --model_key microsoft/graphcodebert-base --do_eval_partial --pct 0.10
- Omitting 15% of the statements at both start and end
python run.py --data_dir ../data --output_dir ../models/graphcodebert-ft --load_model_path ../models/graphcodebert-ft/Epoch_4/model.ckpt --model_key microsoft/graphcodebert-base --do_eval_partial --pct 0.15
-
NS-Slicer, w/o source code pre-training
- Training:
python run.py --data_dir ../data --output_dir ../models/roberta-base --model_key roberta-base --do_train
- Inference:
python run.py --data_dir ../data --output_dir ../models/roberta-base --load_model_path ../models/roberta-base/Epoch_4/model.ckpt --model_key roberta-base --do_eval
-
NS-Slicer, w/o data-flow pre-training
- Training:
python run.py --data_dir ../data --output_dir ../models/codebert-base-mlm --model_key microsoft/codebert-base-mlm --do_train
- Inference:
python run.py --data_dir ../data --output_dir ../models/codebert-base-mlm --load_model_path ../models/codebert-base-mlm/Epoch_4/model.ckpt --model_key microsoft/codebert-base-mlm --do_eval
-
NS-Slicer, w/o mean-pooling
- Training:
python run.py --data_dir ../data --output_dir ../models/graphcodebert-max --model_key microsoft/graphcodebert-base --pooling_strategy max --do_train
- Inference:
python run.py --data_dir ../data --output_dir ../models/graphcodebert-max --load_model_path ../models/graphcodebert-max/Epoch_4/model.ckpt --model_key microsoft/graphcodebert-base --do_eval
- NS-Slicer with CodeBERT
python run.py --data_dir ../data --output_dir ../models/codebert-ft --load_model_path ../models/codebert-ft/Epoch_4/model.ckpt --model_key microsoft/codebert-base --do_eval_aliasing
- NS-Slicer with GraphCodeBERT
python run.py --data_dir ../data --output_dir ../models/graphcodebert-ft --load_model_path ../models/graphcodebert-ft/Epoch_4/model.ckpt --model_key microsoft/graphcodebert-base --do_eval_aliasing
- NS-Slicer with GraphCodeBERT (off-the-shelf)
python run.py --data_dir ../data --output_dir ../models/graphcodebert-pt --load_model_path ../models/graphcodebert-pt/Epoch_4/model.ckpt --model_key microsoft/graphcodebert-base --do_predict
- NS-Slicer with GraphCodeBERT
python run.py --data_dir ../data --output_dir ../models/graphcodebert-ft --load_model_path ../models/graphcodebert-ft/Epoch_4/model.ckpt --model_key microsoft/graphcodebert-base --do_predict
To predict static backward and forward slices for a custom dataset, follow these instructions:
- First, create a JSON file with entries similar to
{train|val|test}-examples.json
and place inns-slicer/data
directory. - Next, create a data structure (e.g.,
XInputExample
) similar to theInputExample
object inns-slicer/src/process.py
to save the input examples. - Create an
XDataProcessor
object subclassing theBaseDataProcessor
object inns-slicer/src/process.py
to process and cache list of input examples (seeprocess.VulDetectDataProcessor
for illustration). - Create an
torch.utils.data.DataLoader
object by adding amake_dataloader_x
function inns-slicer/src/load.py
(seeprocess.make_dataloader_vuldetect
for illustration). - Replace
VulDetectDataProcessor
withXDataProcessor
, andmake_dataloader_vuldetect
withmake_dataloader_x
under theif args.do_predict:
construct inns-slicer/src/run.py
.
To plug other pre-trained language models into NS-Slicer framework:
- Currently, the only allowed choices for
model_key
program argument inns-slicer/src/run.py
aremicrosoft/codebert-base
,microsoft/graphcodebert-base
, androberta-base
. Expand this list with additional pre-trained language models. - Replace
model_key
value with new pre-trained language model key in the corresponding run command (as described above).