Awesome llm security

Awesome llm security
- Survey
- Papers
- Projects
- Other
- Extra reference

Survey

AI Deception: A Survey of Examples, Risks, and Potential Solutions, patterns, 2024, arxiv, pdf, cication: 78

Peter S. Park, Simon Goldstein, Aidan O'Gara, Michael Chen, Dan Hendrycks
JailbreakZoo: Survey, Landscapes, and Horizons in Jailbreaking Large Language and Vision-Language Models, arXiv, 2407.01599, arxiv, pdf, cication: -1

Haibo Jin, Leyang Hu, Xinuo Li, Peiyan Zhang, Chonghan Chen, Jun Zhuang, Haohan Wang

· (chonghan-chen) · (JailbreakZoo - Allen-piexl)
Against The Achilles' Heel: A Survey on Red Teaming for Generative Models, arXiv, 2404.00629, arxiv, pdf, cication: -1

Lizhi Lin, Honglin Mu, Zenan Zhai, Minghan Wang, Yuxia Wang, Renxi Wang, Junjie Gao, Yixuan Zhang, Wanxiang Che, Timothy Baldwin
Breaking Down the Defenses: A Comparative Survey of Attacks on Large Language Models, arXiv, 2403.04786, arxiv, pdf, cication: -1

Arijit Ghosh Chowdhury, Md Mofijul Islam, Vaibhav Kumar, Faysal Hossain Shezan, Vaibhav Kumar, Vinija Jain, Aman Chadha
Open-Sourcing Highly Capable Foundation Models: An evaluation of risks, benefits, and alternative methods for pursuing open-source objectives, arXiv, 2311.09227, arxiv, pdf, cication: -1

Elizabeth Seger, Noemi Dreksler, Richard Moulange, Emily Dardaman, Jonas Schuett, K. Wei, Christoph Winter, Mackenzie Arnold, Seán Ó hÉigeartaigh, Anton Korinek
Adversarial Attacks on LLMs | Lil'Log

Red teaming

Challenges in Red Teaming AI Systems \ Anthropic
ALERT: A Comprehensive Benchmark for Assessing Large Language Models' Safety through Red Teaming, arXiv, 2404.08676, arxiv, pdf, cication: -1

Simone Tedeschi, Felix Friedrich, Patrick Schramowski, Kristian Kersting, Roberto Navigli, Huu Nguyen, Bo Li

· (ALERT - Babelscape)
Red-Teaming Language Models with DSPy | Haize Labs Blog 🕊️
Curiosity-driven Red-teaming for Large Language Models

· (curiosity_redteam - Improbable-AI)
Recourse for reclamation: Chatting with generative language models, arXiv, 2403.14467, arxiv, pdf, cication: -1

Jennifer Chien, Kevin R. McKee, Jackie Kay, William Isaac
Evaluating Frontier Models for Dangerous Capabilities, arXiv, 2403.13793, arxiv, pdf, cication: -1

Mary Phuong, Matthew Aitchison, Elliot Catt, Sarah Cogan, Alexandre Kaskasoli, Victoria Krakovna, David Lindner, Matthew Rahtz, Yannis Assael, Sarah Hodkinson
A Safe Harbor for AI Evaluation and Red Teaming, arXiv, 2403.04893, arxiv, pdf, cication: -1

Shayne Longpre, Sayash Kapoor, Kevin Klyman, Ashwin Ramaswami, Rishi Bommasani, Borhane Blili-Hamelin, Yangsibo Huang, Aviya Skowron, Zheng-Xin Yong, Suhas Kotha
Rainbow Teaming: Open-Ended Generation of Diverse Adversarial Prompts, arXiv, 2402.16822, arxiv, pdf, cication: -1

Mikayel Samvelyan, Sharath Chandra Raparthy, Andrei Lupu, Eric Hambro, Aram H. Markosyan, Manish Bhatt, Yuning Mao, Minqi Jiang, Jack Parker-Holder, Jakob Foerster
MART: Improving LLM Safety with Multi-round Automatic Red-Teaming, arXiv, 2311.07689, arxiv, pdf, cication: -1

Suyu Ge, Chunting Zhou, Rui Hou, Madian Khabsa, Yi-Chia Wang, Qifan Wang, Jiawei Han, Yuning Mao
Summon a Demon and Bind it: A Grounded Theory of LLM Red Teaming in the Wild, arXiv, 2311.06237, arxiv, pdf, cication: -1

Nanna Inie, Jonathan Stray, Leon Derczynski
Moral Foundations of Large Language Models, arXiv, 2310.15337, arxiv, pdf, cication: 7

Marwa Abdulhai, Gregory Serapio-Garcia, Clément Crepy, Daria Valter, John Canny, Natasha Jaques
FLIRT: Feedback Loop In-context Red Teaming, arXiv, 2308.04265, arxiv, pdf, cication: 3

Ninareh Mehrabi, Palash Goyal, Christophe Dupuy, Qian Hu, Shalini Ghosh, Richard Zemel, Kai-Wei Chang, Aram Galstyan, Rahul Gupta
Explore, Establish, Exploit: Red Teaming Language Models from Scratch, arXiv, 2306.09442, arxiv, pdf, cication: 16

Stephen Casper, Jason Lin, Joe Kwon, Gatlen Culp, Dylan Hadfield-Menell

Papers

WalledEval: A Comprehensive Safety Evaluation Toolkit for Large Language Models, arXiv, 2408.03837, arxiv, pdf, cication: -1

Prannaya Gupta, Le Qi Yau, Hao Han Low, I-Shiang Lee, Hugo Maximus Lim, Yu Xin Teoh, Jia Hng Koh, Dar Win Liew, Rishabh Bhardwaj, Rajat Bhardwaj
ShieldGemma: Generative AI Content Moderation Based on Gemma, arXiv, 2407.21772, arxiv, pdf, cication: -1

Wenjun Zeng, Yuchi Liu, Ryan Mullins, Ludovic Peran, Joe Fernandez, Hamza Harkous, Karthik Narasimhan, Drew Proud, Piyush Kumar, Bhaktipriya Radharapu · (huggingface)
AgentPoison: Red-teaming LLM Agents via Poisoning Memory or Knowledge Bases, arXiv, 2407.12784, arxiv, pdf, cication: -1

Zhaorun Chen, Zhen Xiang, Chaowei Xiao, Dawn Song, Bo Li
Towards Guaranteed Safe AI: A Framework for Ensuring Robust and Reliable AI Systems, arXiv, 2405.06624, arxiv, pdf, cication: 12

David "davidad" Dalrymple, Joar Skalse, Yoshua Bengio, Stuart Russell, Max Tegmark, Sanjit Seshia, Steve Omohundro, Christian Szegedy, Ben Goldhaber, Nora Ammann · (mp.weixin.qq)
The GPT Dilemma: Foundation Models and the Shadow of Dual-Use, arXiv, 2407.20442, arxiv, pdf, cication: -1

Alan Hickey
Improving Model Safety Behavior with Rule-Based Rewards | OpenAI

· (cdn.openai) · (safety-rbr-code-and-data - openai)
Safe Unlearning: A Surprisingly Effective and Generalizable Solution to Defend Against Jailbreak Attacks, arXiv, 2407.02855, arxiv, pdf, cication: -1

Zhexin Zhang, Junxiao Yang, Pei Ke, Shiyao Cui, Chujie Zheng, Hongning Wang, Minlie Huang

· (SafeUnlearning - thu-coai)
WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs, arXiv, 2406.18495, arxiv, pdf, cication: -1

Seungju Han, Kavel Rao, Allyson Ettinger, Liwei Jiang, Bill Yuchen Lin, Nathan Lambert, Yejin Choi, Nouha Dziri
Jailbreaking as a Reward Misspecification Problem, arXiv, 2406.14393, arxiv, pdf, cication: -1

Zhihui Xie, Jiahui Gao, Lei Li, Zhenguo Li, Qi Liu, Lingpeng Kong
Adversarial Attacks on Multimodal Agents, arXiv, 2406.12814, arxiv, pdf, cication: -1

Chen Henry Wu, Jing Yu Koh, Ruslan Salakhutdinov, Daniel Fried, Aditi Raghunathan · (agent-attack - ChenWu98)
Merging Improves Self-Critique Against Jailbreak Attacks, arXiv, 2406.07188, arxiv, pdf, cication: -1

Victor Gallego

· (merging-self-critique-jailbreaks - vicgalle)
LLM Agents can Autonomously Exploit One-day Vulnerabilities, arXiv, 2404.08144, arxiv, pdf, cication: -1

Richard Fang, Rohan Bindu, Akul Gupta, Daniel Kang · (mp.weixin.qq)
Introducing v0.5 of the AI Safety Benchmark from MLCommons, arXiv, 2404.12241, arxiv, pdf, cication: -1

Bertie Vidgen, Adarsh Agrawal, Ahmed M. Ahmed, Victor Akinwande, Namir Al-Nuaimi, Najla Alfaraj, Elie Alhajjar, Lora Aroyo, Trupti Bavalatti, Borhane Blili-Hamelin
PrivacyBackdoor - ShanglunFengatETHZ

Privacy backdoors
What Was Your Prompt? A Remote Keylogging Attack on AI Assistants

· (youtu) · (twitter)
JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models, arXiv, 2404.01318, arxiv, pdf, cication: -1

Patrick Chao, Edoardo Debenedetti, Alexander Robey, Maksym Andriushchenko, Francesco Croce, Vikash Sehwag, Edgar Dobriban, Nicolas Flammarion, George J. Pappas, Florian Tramer
What's in Your "Safe" Data?: Identifying Benign Data that Breaks Safety, arXiv, 2404.01099, arxiv, pdf, cication: -1

Luxi He, Mengzhou Xia, Peter Henderson
Many-shot Jailbreaking

· (anthropic)
- it includes a very large number of faux dialogues (~256) preceding the final question which effectively steers the model to produce harmful responses.
Don't Listen To Me: Understanding and Exploring Jailbreak Prompts of Large Language Models, arXiv, 2403.17336, arxiv, pdf, cication: -1

Zhiyuan Yu, Xiaogeng Liu, Shunning Liang, Zach Cameron, Chaowei Xiao, Ning Zhang
Decoding Compressed Trust: Scrutinizing the Trustworthiness of Efficient LLMs Under Compression, arXiv, 2403.15447, arxiv, pdf, cication: -1

Junyuan Hong, Jinhao Duan, Chenhui Zhang, Zhangheng Li, Chulin Xie, Kelsey Lieberman, James Diffenderfer, Brian Bartoldson, Ajay Jaiswal, Kaidi Xu

· (decoding-comp-trust.github)
- quantization is better than pruning for maintaining efficiency and trustworthiness.
SafeDecoding: Defending against Jailbreak Attacks via Safety-Aware Decoding, arXiv, 2402.08983, arxiv, pdf, cication: 1

Zhangchen Xu, Fengqing Jiang, Luyao Niu, Jinyuan Jia, Bill Yuchen Lin, Radha Poovendran · (SafeDecoding - uw-nsl)
DrAttack: Prompt Decomposition and Reconstruction Makes Powerful LLM Jailbreakers, arXiv, 2402.16914, arxiv, pdf, cication: -1

Xirui Li, Ruochen Wang, Minhao Cheng, Tianyi Zhou, Cho-Jui Hsieh · (DrAttack - xirui-li)
Coercing LLMs to do and reveal (almost) anything, arXiv, 2402.14020, arxiv, pdf, cication: -1

Jonas Geiping, Alex Stein, Manli Shu, Khalid Saifullah, Yuxin Wen, Tom Goldstein
How Easy is It to Fool Your Multimodal LLMs? An Empirical Analysis on Deceptive Prompts, arXiv, 2402.13220, arxiv, pdf, cication: -1

Yusu Qian, Haotian Zhang, Yinfei Yang, Zhe Gan
LLM Agents can Autonomously Hack Websites, arXiv, 2402.06664, arxiv, pdf, cication: -1

Richard Fang, Rohan Bindu, Akul Gupta, Qiusi Zhan, Daniel Kang
Robust Prompt Optimization for Defending Language Models Against Jailbreaking Attacks, arXiv, 2401.17263, arxiv, pdf, cication: -1

Andy Zhou, Bo Li, Haohan Wang · (rpo - andyz245)
A Cross-Language Investigation into Jailbreak Attacks in Large Language Models, arXiv, 2401.16765, arxiv, pdf, cication: -1

Jie Li, Yi Liu, Chongyang Liu, Ling Shi, Xiaoning Ren, Yaowen Zheng, Yang Liu, Yinxing Xue
Weak-to-Strong Jailbreaking on Large Language Models, arXiv, 2401.17256, arxiv, pdf, cication: -1

Xuandong Zhao, Xianjun Yang, Tianyu Pang, Chao Du, Lei Li, Yu-Xiang Wang, William Yang Wang · (weak-to-strong - XuandongZhao)
Red Teaming Visual Language Models, arXiv, 2401.12915, arxiv, pdf, cication: -1

Mukai Li, Lei Li, Yuwei Yin, Masood Ahmed, Zhenguang Liu, Qi Liu
AttackEval: How to Evaluate the Effectiveness of Jailbreak Attacking on Large Language Models, arXiv, 2401.09002, arxiv, pdf, cication: -1

Dong shu, Mingyu Jin, Suiyuan Zhu, Beichen Wang, Zihao Zhou, Chong Zhang, Yongfeng Zhang
Open the Pandora's Box of LLMs: Jailbreaking LLMs through Representation Engineering, arXiv, 2401.06824, arxiv, pdf, cication: -1

Tianlong Li, Xiaoqing Zheng, Xuanjing Huang
How Johnny Can Persuade LLMs to Jailbreak Them: Rethinking Persuasion to Challenge AI Safety by Humanizing LLMs, arXiv, 2401.06373, arxiv, pdf, cication: -1

Yi Zeng, Hongpeng Lin, Jingwen Zhang, Diyi Yang, Ruoxi Jia, Weiyan Shi · (chats-lab.github) · (persuasive_jailbreaker - CHATS-lab)
Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training, arXiv, 2401.05566, arxiv, pdf, cication: -1

Evan Hubinger, Carson Denison, Jesse Mu, Mike Lambert, Meg Tong, Monte MacDiarmid, Tamera Lanham, Daniel M. Ziegler, Tim Maxwell, Newton Cheng

· (qbitai)
Exploiting Novel GPT-4 APIs, arXiv, 2312.14302, arxiv, pdf, cication: -1

Kellin Pelrine, Mohammad Taufeeque, Michał Zając, Euan McLean, Adam Gleave
adversarial-random-search-gpt4 - max-andr

Adversarial Attacks on GPT-4 via Simple Random Search [Dec 2023] · (andriushchenko)
Control Risk for Potential Misuse of Artificial Intelligence in Science, arXiv, 2312.06632, arxiv, pdf, cication: -1

Jiyan He, Weitao Feng, Yaosen Min, Jingwei Yi, Kunsheng Tang, Shuai Li, Jie Zhang, Kejiang Chen, Wenbo Zhou, Xing Xie · (mp.weixin.qq)
Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations, arXiv, 2312.06674, arxiv, pdf, cication: -1

Hakan Inan, Kartikeya Upasani, Jianfeng Chi, Rashi Rungta, Krithika Iyer, Yuning Mao, Michael Tontchev, Qing Hu, Brian Fuller, Davide Testuggine

· (ai.meta) · (pdf)
Scalable and Transferable Black-Box Jailbreaks for Language Models via Persona Modulation, arXiv, 2311.03348, arxiv, pdf, cication: -1

Rusheb Shah, Quentin Feuillade--Montixi, Soroush Pour, Arush Tagade, Stephen Casper, Javier Rando
Scalable Extraction of Training Data from (Production) Language Models, arXiv, 2311.17035, arxiv, pdf, cication: -1

Milad Nasr, Nicholas Carlini, Jonathan Hayase, Matthew Jagielski, A. Feder Cooper, Daphne Ippolito, Christopher A. Choquette-Choo, Eric Wallace, Florian Tramèr, Katherine Lee · (qbitai)
Scalable AI Safety via Doubly-Efficient Debate, arXiv, 2311.14125, arxiv, pdf, cication: -1

Jonah Brown-Cohen, Geoffrey Irving, Georgios Piliouras · (debate - google-deepmind)
DeepInception: Hypnotize Large Language Model to Be Jailbreaker, arXiv, 2311.03191, arxiv, pdf, cication: -1

Xuan Li, Zhanke Zhou, Jianing Zhu, Jiangchao Yao, Tongliang Liu, Bo Han · (DeepInception - tmlr-group) · (deepinception.github)
Removing RLHF Protections in GPT-4 via Fine-Tuning, arXiv, 2311.05553, arxiv, pdf, cication: -1

Qiusi Zhan, Richard Fang, Rohan Bindu, Akul Gupta, Tatsunori Hashimoto, Daniel Kang · (mp.weixin.qq)
Frontier Language Models are not Robust to Adversarial Arithmetic, or "What do I need to say so you agree 2+2=5?, arXiv, 2311.07587, arxiv, pdf, cication: -1

C. Daniel Freeman, Laura Culp, Aaron Parisi, Maxwell L Bileschi, Gamaleldin F Elsayed, Alex Rizkowsky, Isabelle Simpson, Alex Alemi, Azade Nova, Ben Adlam
Unveiling Safety Vulnerabilities of Large Language Models, arXiv, 2311.04124, arxiv, pdf, cication: -1

George Kour, Marcel Zalmanovici, Naama Zwerdling, Esther Goldbraich, Ora Nova Fandina, Ateret Anaby-Tavor, Orna Raz, Eitan Farchi
Managing AI Risks in an Era of Rapid Progress, arXiv, 2310.17688, arxiv, pdf, cication: 3

Yoshua Bengio, Geoffrey Hinton, Andrew Yao, Dawn Song, Pieter Abbeel, Yuval Noah Harari, Ya-Qin Zhang, Lan Xue, Shai Shalev-Shwartz, Gillian Hadfield · (managing-ai-risks)
Jailbreaking Black Box Large Language Models in Twenty Queries, arXiv, 2310.08419, arxiv, pdf, cication: 3

Patrick Chao, Alexander Robey, Edgar Dobriban, Hamed Hassani, George J. Pappas, Eric Wong · (qbitai)
How Robust is Google's Bard to Adversarial Image Attacks?, arXiv, 2309.11751, arxiv, pdf, cication: 1

Yinpeng Dong, Huanran Chen, Jiawei Chen, Zhengwei Fang, Xiao Yang, Yichi Zhang, Yu Tian, Hang Su, Jun Zhu · (jiqizhixin)
Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!, arXiv, 2310.03693, arxiv, pdf, cication: 3

Xiangyu Qi, Yi Zeng, Tinghao Xie, Pin-Yu Chen, Ruoxi Jia, Prateek Mittal, Peter Henderson · (mp.weixin.qq)
GPTFUZZER: Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts, arXiv, 2309.10253, arxiv, pdf, cication: 5

Jiahao Yu, Xingwei Lin, Zheng Yu, Xinyu Xing · (gptfuzz - sherdencooper)
"Do Anything Now": Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models, arXiv, 2308.03825, arxiv, pdf, cication: 25

Xinyue Shen, Zeyuan Chen, Michael Backes, Yun Shen, Yang Zhang
MasterKey: Automated Jailbreak Across Multiple Large Language Model Chatbots, arXiv, 2307.08715, arxiv, pdf, cication: 13

Gelei Deng, Yi Liu, Yuekang Li, Kailong Wang, Ying Zhang, Zefeng Li, Haoyu Wang, Tianwei Zhang, Yang Liu · (qbitai)
Universal and Transferable Adversarial Attacks on Aligned Language Models, arXiv, 2307.15043, arxiv, pdf, cication: 58

Andy Zou, Zifan Wang, J. Zico Kolter, Matt Fredrikson · (llm-attacks - llm-attacks) · (qbitai)
PUMA: Secure Inference of LLaMA-7B in Five Minutes, arXiv, 2307.12533, arxiv, pdf, cication: 3

Ye Dong, Wen-jie Lu, Yancheng Zheng, Haoqi Wu, Derun Zhao, Jin Tan, Zhicong Huang, Cheng Hong, Tao Wei, Wenguang Chen
International Institutions for Advanced AI, arXiv, 2307.04699, arxiv, pdf, cication: 9

Lewis Ho, Joslyn Barnhart, Robert Trager, Yoshua Bengio, Miles Brundage, Allison Carnegie, Rumman Chowdhury, Allan Dafoe, Gillian Hadfield, Margaret Levi
"Do Anything Now": Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models, arXiv, 2308.03825, arxiv, pdf, cication: 162

Xinyue Shen, Zeyuan Chen, Michael Backes, Yun Shen, Yang Zhang · (jailbreak_llms - verazuo)
An Overview of Catastrophic AI Risks, arXiv, 2306.12001, arxiv, pdf, cication: 20

Dan Hendrycks, Mantas Mazeika, Thomas Woodside · (mp.weixin.qq)
Jailbroken: How Does LLM Safety Training Fail?, arXiv, 2307.02483, arxiv, pdf, cication: 54

Alexander Wei, Nika Haghtalab, Jacob Steinhardt
PromptBench: Towards Evaluating the Robustness of Large Language Models on Adversarial Prompts, arXiv, 2306.04528, arxiv, pdf, cication: 32

Kaijie Zhu, Jindong Wang, Jiaheng Zhou, Zichen Wang, Hao Chen, Yidong Wang, Linyi Yang, Wei Ye, Yue Zhang, Neil Zhenqiang Gong

Projects

CJA_Comprehensive_Jailbreak_Assessment - Junjie-Chu

This is the public code repository of paper 'Comprehensive Assessment of Jailbreak Attacks Against LLMs'
Llama-Guard-3-8B - meta-llama 🤗
Prompt-Guard-86M - meta-llama 🤗
jailbreak_llms - verazuo

[CCS'24] A dataset consists of 15,140 ChatGPT prompts from Reddit, Discord, websites, and open-source datasets (including 1,405 jailbreak prompts).
prompt-injection-defenses - tldrsec

Every practical and proposed defense against prompt injection.
PurpleLlama - meta-llama

Set of tools to assess and improve LLM security.
ps-fuzz - prompt-security

Make your GenAI Apps Safe & Secure Test & harden your system prompt
PyRIT - Azure

The Python Risk Identification Tool for generative AI (PyRIT) is an open access automation framework to empower security professionals and machine learning engineers to proactively find risks in their generative AI systems.
ai-exploits - protectai

A collection of real world AI/ML exploits for responsibly disclosed vulnerabilities
chatgpt_system_prompt - LouisShark

store all chatgpt's system prompt
cipherchat - robustnlp

A framework to evaluate the generalization capability of safety alignment for LLMs

Other

A Primer on LLM Security – Hacking Large Language Models for Beginners
Making a SOTA Adversarial Attack on LLMs 38x Faster | Haize Labs Blog 🕊️
Introducing the Red-Teaming Resistance Leaderboard
fast.ai - AI Safety and the Age of Dislightenment
PoisonGPT: How We Hid a Lobotomized LLM on Hugging Face to Spread Fake News

· (mp.weixin.qq)
ChatGPT最近被微软内部禁用！GPTs新bug：数据两句话就能套走 | 量子位
一段话让模型自曝「系统提示词」！ChatGPT、Bing无一幸免 | 量子位
一段乱码，竟让ChatGPT越狱！乱序prompt让LLM火速生成勒索软件，Jim Fan惊了

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

awesome_llm_security.md

awesome_llm_security.md

Awesome llm security

Survey

Red teaming

Papers

Projects

Other

Extra reference

Files

awesome_llm_security.md

Latest commit

History

awesome_llm_security.md

File metadata and controls

Awesome llm security

Survey

Red teaming

Papers

Projects

Other

Extra reference