by Rishub Tamirisa*, Bhrugu Bharathi*, Long Phan, Andy Zhou, Alice Gatti, Tarun Suresh, Maxwell Lin, Justin Wang, Rowan Wang, Ron Arel, Andy Zou, Dawn Song, Bo Li, Dan Hendrycks, and Mantas Mazeika
See our project page and paper on ArXiv.
We introduce a novel method, Tampering Attack Resistance (TAR), which is the first defense to withstand a significant number of open-weight fine-tuning attacks on LLMs, while preserving model capabilities.
- 📰 Updates 📰
- 🛡️ What are Tamper-Resistant Safeguards? 🛡️
- 🌐 Overview 🌐
- ☕ Quick Start ☕
- 📁 Directory Structure
- 🤗 Models and Datasets
- 🙏 Citation 🙏
- [2024/10/14] TAR-Bio-v2: We identified a data contamination issue in our instruction-following retain dataset; we've resolved the issue and trained a new model: 🤗 Llama-3-8B-Instruct-TAR-Bio-v2. Please use this model for evaluations, thanks!
- [2024/08/07] TAR Release: Initial code release, including red-teaming evaluation + baseline implementations, and 🤗 Huggingface models!
Tamper-Resistant Safeguards are security measures designed for open-weight large language models (LLMs) to protect against malicious modifications of the model's weights. Unlike traditional safeguards that focus on preventing input-based attacks, these advanced safeguards prevent adversaries with access to full model weights from recovering performance on harmful capabilities. We demonstrate in our extensive red-teaming evaluation that Tamper-Resistant Safeguards created via TAR are the first to be robust to a significant number of open-weight fine-tuning attacks.
This repository contains implementations for TAR (including the Random Mapping initial safeguard), the red-teaming evaluation used in the paper, and baseline methods.
-
Clone and enter the repository:
git clone https://github.com/rishub-tamirisa/tamper-resistance.git cd tamper-resistance
-
Install dependencies:
pip install -r requirements.txt
-
Setup the dotenv (
.env
):- In the root level of the repository, create a
.env
file following the format of the includeddotenv
file. - We've already included the FSDP configs used for running the method in the
configs
folder. You can use these or create your own. For running TAR with FSDP v1, it's important thatfsdp_use_orig_params=false
andfsdp_sharding_strategy=1
. - Finally, set the environment variables:
source .env
- In the root level of the repository, create a
Caution
Do not push your .env
file to a public repository. Since it contains your Huggingface token and other secrets, it could lead to unauthorized access to your Huggingface account. We've already included it in the .gitignore
file to prevent this.
tar.py
serves as the main entrypoint for running the TAR method. It uses python modules in the modules
folder. Example usage is provided in the run_tar_bio.sh
and run_tar_cyber.sh
scripts.
The modules
folder contains the following files:
baselines.py
: Entrypoint for running baseline methodsdataloaders.py
: Dataloader implementationsobjectives.py
: Objective / loss function implementationsfsdp_v1_utils.py
: Utilities for FSDP v1training.py
: All training loop implementations, including TARutils.py
: Helper functions
The red_teaming
folder contains implementations for running all fine-tuning attacks discussed in the paper, as well as an FSDP-supported MMLU evaluation script.
Note
The current implementation assumes that models come from 🤗 Transformers, meaning they have the expected configs, subclasses, etc. However, the FSDP wrapping can be made compatible with any model. We plan to update the code to be more agnostic when we migrate to FSDP v2. (This repository also serves as a scalable first-order meta-learning implementation)
We provide scripts in the root-level folder for running TAR for biosecurity and cybersecurity: run_tar_bio.sh
and run_tar_cyber.sh
.
It's recommended to run Llama-3-8B-Instruct models (or similar size) on systems with 8xA100 80G
or more VRAM due to full-parameter training and other overheads introduced by the first-order meta-learning implementation.
Note: the code is currently untested on multi-node environments, we expect to support this upon migration to the recently released FSDP2
from PyTorch 2.4.
With the appropriate GPU setup, and assuming the .env
is correctly set, simply run:
sh run_tar_bio.sh
In the red_teaming
folder, red_teaming_evaluation.py
serves as the entrypoint for running the red-teaming evaluations from the paper. Most methods use full-parameter training, so scripts should be launched with accelerate
similar to the setup in the run_tar_bio.sh
and run_tar_cyber.sh
scripts.
Check out the README
documentation in the red_teaming
folder for full details, as well as the documentation in red_teaming/mmlu_eval
for specific details on running the full evaluation.
We release models and datasets here: 🤗 Huggingface Collection.
If you find this repository useful in your research, please consider citing our paper:
@misc{tamirisa2024tamperresistantsafeguardsopenweightllms,
title={Tamper-Resistant Safeguards for Open-Weight LLMs},
author={Rishub Tamirisa and Bhrugu Bharathi and Long Phan and Andy Zhou and Alice Gatti and Tarun Suresh and Maxwell Lin and Justin Wang and Rowan Wang and Ron Arel and Andy Zou and Dawn Song and Bo Li and Dan Hendrycks and Mantas Mazeika},
year={2024},
eprint={2408.00761},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2408.00761},
}