Impact of Identifier Normalization on Vulnerability Detection Techniques

This study examines the impact of identifier normalization on software vulnerability detection in three approaches: static analysis tools, specialized machine learning (ML) models, and large language models (LLMs). Using the BigVul dataset of vulnerabilities in C/C++ projects, the research evaluates the performance of these methods under normalized (generalized naming of variables/functions) and nonnormalized conditions. Static analysis tools such as Flawfinder and CppCheck exhibit limited effectiveness (F1 scores ~0.1) and are unaffected by normalization. Specialized ML models, such as LineVul, achieve high F1 scores on non-normalized data (F1 ~0.9) but suffer significant performance drops when tested on normalized inputs, highlighting their lack of generalizability. In contrast, LLMs like Llama3, although underperforming in their pretrained state, show substantial improvement after fine-tuning, achieving robust and consistent results across both normalized and non-normalized datasets. The findings suggest that while static analysis tools are less effective, fine-tuned LLMs hold strong potential for scalable and generalized vulnerability detection. The study recommends further exploration of hybrid approaches combining ML models, LLMs, and traditional tools to enhance accuracy and adaptability across diverse scenarios.

Folder structure

Folder name	Folder content
fine-tuned	Results from the fine-tuned Llama3 models
input	Training, validation and test data (compressed due to size limitations)
linevul	Results from the trained linevul models
pre-trained	Results from a pre-trained Llama3 model
sast	Results from CPPCheck and Flawfinder

File naming convention

fine-tuned and linevul

train_<training_dataset>_eval_<evaluation_dataset>.csv

pre-trained

eval_<evaluation_dataset>.csv

sast

<tool_name>_eval_<evaluation_dataset>.csv

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
fine-tuned		fine-tuned
linevul		linevul
pre-trained		pre-trained
sast		sast
.gitattributes		.gitattributes
README.md		README.md
input_dataset.zip		input_dataset.zip

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Impact of Identifier Normalization on Vulnerability Detection Techniques

Folder structure

File naming convention

fine-tuned and linevul

pre-trained

sast

About

Releases

Packages

tuhh-softsec/Impact-of-Identifier-Normalization-on-Vulnerability-Detection-Techniques

Folders and files

Latest commit

History

Repository files navigation

Impact of Identifier Normalization on Vulnerability Detection Techniques

Folder structure

File naming convention

fine-tuned and linevul

pre-trained

sast

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages