InjecGuard: Benchmarking and Mitigating Over-defense in Prompt Injection Guardrail Models

Hao Li*, Xiaogeng Liu*. (*Equal Contribution)

This repository hosts the official code, data and model weights of InjecGuard, the first prompt guard model against prompt injection to be built with open-source training data and detailed documentation, consistently achieving remarkable performance in benign, malicious, and over-defense accuracy.

🎉 News

[2024.10.31] 🎉🎉 Our paper has been released on arXiv.
[2024.10.28] 📷 Provide an online demo of InjecGuard.
[2024.10.27] 🤗Release the NotInject dataset.
[2024.10.27] 🛠️ Release the code of InjecGuard.

Abstract

Prompt injection attacks pose a critical threat to large language models (LLMs), enabling goal hijacking and data leakage. Prompt guard models, though effective in defense, suffer from over-defense—falsely flagging benign inputs as malicious due to trigger word bias. To address this issue, we introduce NotInject, an evaluation dataset that systematically measures over-defense across various prompt guard models. NotInject contains 339 benign samples enriched with trigger words common in prompt injection attacks, enabling fine-grained evaluation. Our results show that state-of-the-art models suffer from over-defense issues, with accuracy dropping close to random guessing levels (60%). To mitigate this, we propose InjecGuard, a novel prompt guard model that incorporates a new training strategy, Mitigating Over-defense for Free (MOF), which significantly reduces the bias on trigger words. InjecGuard demonstrates state-of-the-art performance on diverse benchmarks including NotInject, surpassing the existing best model by 30.8%, offering a robust and open-source solution for detecting prompt injection attacks.

Demos of InjecGuard

Demo.mp4

We have released an online demo, you can access it here.

NotInject Dataset

To address the over-defense issue commonly seen in existing guard models, we introduce the NotInject dataset, designed to evaluate the extent of over-defense in these models. We identify certain trigger words that may cause defense shortcuts in guard models and use them to construct benign sentences. The dataset is divided into three subsets, each containing sentences with one, two, or three trigger words. For each subset, we create 113 benign sentences across four topics: Common Queries, Technique Queries, Virtual Creation, and Multilingual Queries.

Requirements

We recommend the following dependencies.

Python 3.10
PyTorch 2.4.0

Then, please install other environment dependencies through:

pip install -r requirements.txt

Getting Started

💾 Checkpoints

You can directly download our trained checkpoints here.

⚙️ Dataset Preparation

Train set: We have provided our training dataset in the path of InjecGuard\datasets, collected from 20 open-source datasets and several of our LLM augmented data.
Valid set: We select several samples (144) from NotInject, BIPIA, Wildguard-Benign, and PINT to conduct validation, which have been provided in the path of InjecGuard\datasets.
Test set: We select NotInject, BIPIA, Wildguard-Benign, and PINT to evaluate the benign, malicious, and over-defense of the model. The first three are all provided in the path of InjecGuard\datasets. The benchmark of PINT is not public, but you can request access to it by filling out here.

Note: Once you’ve downloaded the PINT benchmark, convert it from original YAML to JSON format by executing the following command:

python util.py

🔥 Train your InjecGuard

There are some of arguments you can set:

--train_set: the path to the train set file.
--valid_set: the path to the valid set file.
--dataset_root: the folder to place test sets.
--batch_size: you can modify it to fit your GPU memory size.
--epochs: the number of training iterations for each sample.
--eval_batch_size: The batch size in the evaluation process.
--save_step: the step interval to save models.
--checkpoint_path: you can modify it to fit your GPU memory size.
--logs: where to store logs.
--max_length: the maximum length of input tokens.
--resume: the model you want to load.
--save_thres: the performance threshold to save models, the model will only be saved when the performance exceeds the threshold.
--resume: the model you want to load.

Then, you can train InjecGuard by excuting the command:

python train.py

📋 Evaluation

You can evaluate the model on both 4 datasets (NotInject, PINT, Wildguard-Benign, BIPIA) by excuting the command:

python eval.py --resume ${CHECKPOINT}$

📈 Results

Citation

If you find this work useful in your research or applications, we appreciate that if you can kindly cite:

@articles{InjecGuard,
  title={InjecGuard: Benchmarking and Mitigating Over-defense in Prompt Injection Guardrail Models},
  author={Hao Li and Xiaogeng Liu},
  journal = {arXiv preprint arXiv:2410.22770},
  year={2024}
}

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
__pycache__		__pycache__
assets		assets
datasets		datasets
logs		logs
.gitignore		.gitignore
InjecGuard.py		InjecGuard.py
LICENSE		LICENSE
README.md		README.md
eval.py		eval.py
params.py		params.py
requirements.txt		requirements.txt
train.py		train.py
util.py		util.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

InjecGuard: Benchmarking and Mitigating Over-defense in Prompt Injection Guardrail Models

🎉 News

Abstract

Demos of InjecGuard

NotInject Dataset

Requirements

Getting Started

💾 Checkpoints

⚙️ Dataset Preparation

🔥 Train your InjecGuard

📋 Evaluation

📈 Results

Citation

About

Releases

Packages

Contributors 2

Languages

License

SaFoLab-WISC/InjecGuard

Folders and files

Latest commit

History

Repository files navigation

InjecGuard: Benchmarking and Mitigating Over-defense in Prompt Injection Guardrail Models

🎉 News

Abstract

Demos of InjecGuard

NotInject Dataset

Requirements

Getting Started

💾 Checkpoints

⚙️ Dataset Preparation

🔥 Train your InjecGuard

📋 Evaluation

📈 Results

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages