Skip to content

IBM/activation-steering

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

25 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Python

Activation Steering

👉 Preprint Released! Programming Refusal with Conditional Activation Steering on arXiv

Overview

This is a general-purpose activation steering library to (1) extract vectors and (2) steer model behavior. We release this library alongside our recent report on Programming Refusal with Conditional Activation Steering to provide an intuitive toolchain for activation steering efforts.

Installation

git clone https://github.com/IBM/activation-steering

pip install -e activation-steering

Activation Steering

Activation steering is a technique for influencing the behavior of language models by modifying their internal activations during inference. This library provides tools for:

  • Extracting steering vectors from contrastive examples
  • Applying steering vectors to modify model behavior

Conditional Activation Steering

Conditional activation steering selectively applies or withholds activation steering based on the input context. Conditional activation steering extends the activation steering framework by introducing:

  • Context-dependent control capabilities through condition vectors
  • Logical composition of multiple condition vectors

For detailed implementation and usage of both activation steering and conditional activation steering, refer to our paper and the documentation.

Documentation

Refer to /docs to understand this library. We recommend starting with Quick Start Tutorial as it covers most concepts that you need to get started with activation steering and conditional activation steering.

  • Quick Start Tutorial (10 minutes ~ 60 minutes, depending on your hardware) 👉 here!
  • FAQ 👉 here!

Quick Colab Examples

  • Adding Refusal Behavior to LLaMA 3.1 8B Inst 👉 here!
  • Adding CoT Behavior to Gemma 2 9B 👉 here!

Acknowledgement

This library builds on top of the excellent work done in the following repositories:

Some parts of the documentation for this library are generated by

Citation

@misc{lee2024programmingrefusalconditionalactivation,
      title={Programming Refusal with Conditional Activation Steering}, 
      author={Bruce W. Lee and Inkit Padhi and Karthikeyan Natesan Ramamurthy and Erik Miehling and Pierre Dognin and Manish Nagireddy and Amit Dhurandhar},
      year={2024},
      eprint={2409.05907},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2409.05907}, 
}