BigCVE ⚙️

📍 About

BigCVE is a data pipeline created by Yuri Batan (tOSU) and Nicholas Cope (UNC) in the University of Missouri's Consumer Networking REU Program ('24).

Extracts vulnerable code snippets and their corresponding fixes from the BigVul and CVEFixes dataset.
Takes functions and converts them into Code Property Graphs in a .dot format
Creates temporal representation CPG of the fixes introduced to fix the vulnerability by utilizing one of the many graph matching approaches
- In 📁 Matching
Temporal CPG's tokenized with the StarCoder model and transformed into a format suitable for machine learning models. .pkl

BigCVE's .pkl files have only been tested on a Graph Neural Network but it has the possibility to used on alternative deep learning models.

Run the CSV Cleaner on BigVul dataset to isolate changing functions. This script is present in Data_Prep/csv_cleaner.py
Convert all of the functions into .cpp format by running Data_Prep/bigvul_parser.py
Convert all of the functions into a Code Property Graph(CPG) using generate_cpgs.py
Create a concatenated version of each function's CPG using combine_dots.py
Run your desired graph matching algorithm and its inverse counterpart by using one of the graph_match scripts in Data_Prep
Assign each edge/node in the matched .dot file a unique integer ID by running Data_Prep/dot_cleaner.py
Convert into a PKL file using cpg_to_pickle in Data_Prep.
Send into VulGNN

Name		Name	Last commit message	Last commit date
Latest commit History 114 Commits
.idea		.idea
BigVul		BigVul
CVEFixes		CVEFixes
venv/lib/python3.12/site-packages		venv/lib/python3.12/site-packages
.gitignore		.gitignore
README.md		README.md