Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs

Abhay Sheshadri,* asheshadri31@gatech.edu; Aidan Ewart,* aidanprattewart@gmail.com; Phillip Guo,* phguo@umd.edu; Aengus Lynch,* aenguslynch@gmail.com; Cindy Wu,* wu.cindyx@gmail.com; Vivek Hebbar*; Henry Sleight; Asa Cooper Stickland; Ethan Perez; Dylan Hadfield-Menell; Stephen Casper, scasper@mit.edu

See our models on Hugging Face Hub:.

Read the paper on arXiv: Targeted Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs.

Chat with our robust refusal model (https://huggingface.co/LLM-LAT/robust-llama3-8b-instruct) at https://www.abhayesian.com/lat-chat.

@article{sheshadri2024targeted,
  title={Targeted Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs},
  author={Sheshadri, Abhay and Ewart, Aidan and Guo, Phillip and Lynch, Aengus and Wu, Cindy and Hebbar, Vivek and Sleight, Henry and Stickland, Asa Cooper and Perez, Ethan and Hadfield-Menell, Dylan and Casper, Stephen},
  journal={arXiv preprint arXiv:2407.15549},
  year={2024}
}

See also preliminary work: Defending Against Unforeseen Failure Modes with Latent Adversarial Training.

This repository

This repository contains code for implementing latent adversarial attacks and latent adversarial training (LAT) in LLMs.

To perform targeted latent adversarial training (LAT) in LLMs, we perturb the latent activations in an LLM’s residual stream to elicit specific failure modes from the model. Then, we fine-tune LLMs on the target task under these perturbations. We use this approach to improve robustness to jailbreaks, remove backdoors without access to the trigger, and unlearn undesirable knowledge.

Setup

After you clone and navigate to the repository:

pip install -r requirements.txt
bash install_tasks_from_github.sh

Ready to go with the notebooks

Find notebooks for latent space attacks, jaiblreak robustness, backdoor removal, harry potter unlearning, and wmdp unlearning in the /notebooks folder.

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
figs		figs
latent_at		latent_at
notebooks		notebooks
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
install_tasks_from_github.sh		install_tasks_from_github.sh
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs

This repository

Setup

Ready to go with the notebooks

About

Releases

Packages

Contributors 5

Languages

License

aengusl/latent-adversarial-training

Folders and files

Latest commit

History

Repository files navigation

Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs

This repository

Setup

Ready to go with the notebooks

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 5

Languages

Packages