Safety Arithmetic: A Framework for Test-time Safety Alignment of Language Models by Steering Parameters and Activations

👉 Dataset coming soon!

Installation

pip install -r requirement.txt

Experiments

Safety Arithmetic
Harm Direction Removal (HDR): TIES, Task Vector
ICV

FileStructure

Safety Arithmetic

Run Safety_Arithmetic_Base_and_SFT.ipynb file for BASE and SFT models.
Run Safety_Arithmetic_Edited.ipynb file for EDITED models.

Harm Direction Removal (HDR) (w/ TIES)

Run HDR/HDR_TIES_BASE_AND_SFT.ipynb for SFT models and BASE models
Run HDR/HDR_TIES_EDITED.ipynb for EDITED model.

Harm Direction Removal (HDR) (w/ Task Vector)

Run HDR/HDR_Task_Vector_BASE.ipynb for BASE models
Run HDR/HDR_Task_Vector_SFT.ipynb for SFT models
Run HDR/HDR_Task_Vector_EDITED.ipynb for EDITED models.

Only ICV

Run Safety_Arithmetic_Base_and_SFT.ipynb file by passing direct base/sft (without HDR).
Run Safety_Arithmetic_Edited.ipynb file by passing direct edited (without HDR).

Citation

If you find this useful in your research, please consider citing:

@misc{hazra2024safety,
      title={Safety Arithmetic: A Framework for Test-time Safety Alignment of Language Models by Steering Parameters and Activations}, 
      author={Rima Hazra and Sayan Layek and Somnath Banerjee and Soujanya Poria},
      year={2024},
      eprint={2406.11801},
      archivePrefix={arXiv},
      primaryClass={id='cs.CL' full_name='Computation and Language' is_active=True alt_name='cmp-lg' in_archive='cs' is_general=False description='Covers natural language processing. Roughly includes material in ACM Subject Class I.2.7. Note that work on artificial languages (programming languages, logics, formal systems) that does not explicitly address natural-language issues broadly construed (natural-language processing, computational linguistics, speech, text retrieval, etc.) is not appropriate for this area.'}
}

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
HDR		HDR
models		models
tasks		tasks
utils		utils
README.md		README.md
Safety_Arithmetic_Base_and_SFT.ipynb		Safety_Arithmetic_Base_and_SFT.ipynb
Safety_Arithmetic_Edited.ipynb		Safety_Arithmetic_Edited.ipynb
anchor.py		anchor.py
common.py		common.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Safety Arithmetic: A Framework for Test-time Safety Alignment of Language Models by Steering Parameters and Activations

Table of Contents

Installation

Experiments

FileStructure

Safety Arithmetic

Harm Direction Removal (HDR) (w/ TIES)

Harm Direction Removal (HDR) (w/ Task Vector)

Only ICV

Citation

About

Releases

Packages

Contributors 2

Languages

declare-lab/safety-arithmetic

Folders and files

Latest commit

History

Repository files navigation

Safety Arithmetic: A Framework for Test-time Safety Alignment of Language Models by Steering Parameters and Activations

Table of Contents

Installation

Experiments

FileStructure

Safety Arithmetic

Harm Direction Removal (HDR) (w/ TIES)

Harm Direction Removal (HDR) (w/ Task Vector)

Only ICV

Citation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages