Robust Prompt Optimization

This is the official repository for "Robust Prompt Optimization for Defending Language Models Against Jailbreaking Attacks" by Andy Zhou, Bo Li, and Haohan Wang. Website with more examples can be found here.

We include a notebook demo.ipynb containing the minimal implementation of RPO, for defending Llama-2 against the popular AIM attack.

This can also be found on Colab

This repository is based on the GCG repository (https://github.com/llm-attacks/llm-attacks)

Installation

The rpo package can be installed by running the following command at the root of this repository:

pip install livelossplot
pip install -e .

Models

Please follow the instructions to download LLaMA-2-7B-Chat first (we use the weights converted by HuggingFace here). To modify the paths to your models and tokenizers, please add the following lines in experiments/configs/transfer_xxx.py.

    config.model_paths = [
        "/DIR/vicuna/vicuna-7b-v1.3",
        ... # more models
    ]
    config.tokenizer_paths = [
        "/DIR/vicuna/vicuna-7b-v1.3",
        ... # more tokenizers
    ]

Experiments

Jailbreaks can be added as text files to the experiments/jailbreaks folder. The experiments folder contains code to reproduce RPO experiments on AdvBench. Modify the script with the correct path and run the following code inside experiments:

cd launch_scripts
bash run_gcg_multiple.sh vicuna # or llama2

Examples

Here are some example strings we optimized. These are effective across language models and jailbreaks. It is fair to directly use these strings for baseline comparisons as they are expected to transfer to other models and attacks.

Starling-7B

:}</'])language seems scary-----------}] estimation >>begin &=&pgf degli ({\ $|\textwidthAI NOTE:

Llama2-7B-Chat

was Bel Let]\"> It\'\ si distinguish \({\footnote{[:// Inst bitimidenote]{BWebachiv

Evaluation

We provide a simple script in experiments/test.py to evaluate the robustness of generated suffixes on jailbreaks. Code will be updated with JailbreakBench evaluation later.

Citation

If you found our paper or repo interesting, thanks! Please consider citing the following

@misc{zhou2024robust,
      title={Robust Prompt Optimization for Defending Language Models Against Jailbreaking Attacks}, 
      author={Andy Zhou and Bo Li and Haohan Wang},
      year={2024},
      eprint={2401.17263},
      archivePrefix={arXiv},
      primaryClass={cs.LG}
}

License

rpo is licensed under the terms of the MIT license. See LICENSE for more details.

Name		Name	Last commit message	Last commit date
Latest commit History 65 Commits
.github/workflows		.github/workflows
data		data
experiments		experiments
figures		figures
rpo.egg-info		rpo.egg-info
rpo		rpo
README.md		README.md
demo.ipynb		demo.ipynb
demo.py		demo.py
index.md		index.md
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Robust Prompt Optimization

Installation

Models

Experiments

Examples

Evaluation

Citation

License

About

Releases

Packages

Languages

lapisrocks/rpo

Folders and files

Latest commit

History

Repository files navigation

Robust Prompt Optimization

Installation

Models

Experiments

Examples

Evaluation

Citation

License

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages