-
AI Deception: A Survey of Examples, Risks, and Potential Solutions,
patterns, 2024
, arxiv, pdf, cication: 78Peter S. Park, Simon Goldstein, Aidan O'Gara, Michael Chen, Dan Hendrycks
-
JailbreakZoo: Survey, Landscapes, and Horizons in Jailbreaking Large Language and Vision-Language Models,
arXiv, 2407.01599
, arxiv, pdf, cication: -1Haibo Jin, Leyang Hu, Xinuo Li, Peiyan Zhang, Chonghan Chen, Jun Zhuang, Haohan Wang
· (chonghan-chen) · (JailbreakZoo - Allen-piexl)
-
Against The Achilles' Heel: A Survey on Red Teaming for Generative Models,
arXiv, 2404.00629
, arxiv, pdf, cication: -1Lizhi Lin, Honglin Mu, Zenan Zhai, Minghan Wang, Yuxia Wang, Renxi Wang, Junjie Gao, Yixuan Zhang, Wanxiang Che, Timothy Baldwin
-
Breaking Down the Defenses: A Comparative Survey of Attacks on Large Language Models,
arXiv, 2403.04786
, arxiv, pdf, cication: -1Arijit Ghosh Chowdhury, Md Mofijul Islam, Vaibhav Kumar, Faysal Hossain Shezan, Vaibhav Kumar, Vinija Jain, Aman Chadha
-
Open-Sourcing Highly Capable Foundation Models: An evaluation of risks, benefits, and alternative methods for pursuing open-source objectives,
arXiv, 2311.09227
, arxiv, pdf, cication: -1Elizabeth Seger, Noemi Dreksler, Richard Moulange, Emily Dardaman, Jonas Schuett, K. Wei, Christoph Winter, Mackenzie Arnold, Seán Ó hÉigeartaigh, Anton Korinek
-
ALERT: A Comprehensive Benchmark for Assessing Large Language Models' Safety through Red Teaming,
arXiv, 2404.08676
, arxiv, pdf, cication: -1Simone Tedeschi, Felix Friedrich, Patrick Schramowski, Kristian Kersting, Roberto Navigli, Huu Nguyen, Bo Li
· (ALERT - Babelscape)
-
Curiosity-driven Red-teaming for Large Language Models
· (curiosity_redteam - Improbable-AI)
-
Recourse for reclamation: Chatting with generative language models,
arXiv, 2403.14467
, arxiv, pdf, cication: -1Jennifer Chien, Kevin R. McKee, Jackie Kay, William Isaac
-
Evaluating Frontier Models for Dangerous Capabilities,
arXiv, 2403.13793
, arxiv, pdf, cication: -1Mary Phuong, Matthew Aitchison, Elliot Catt, Sarah Cogan, Alexandre Kaskasoli, Victoria Krakovna, David Lindner, Matthew Rahtz, Yannis Assael, Sarah Hodkinson
-
A Safe Harbor for AI Evaluation and Red Teaming,
arXiv, 2403.04893
, arxiv, pdf, cication: -1Shayne Longpre, Sayash Kapoor, Kevin Klyman, Ashwin Ramaswami, Rishi Bommasani, Borhane Blili-Hamelin, Yangsibo Huang, Aviya Skowron, Zheng-Xin Yong, Suhas Kotha
-
Rainbow Teaming: Open-Ended Generation of Diverse Adversarial Prompts,
arXiv, 2402.16822
, arxiv, pdf, cication: -1Mikayel Samvelyan, Sharath Chandra Raparthy, Andrei Lupu, Eric Hambro, Aram H. Markosyan, Manish Bhatt, Yuning Mao, Minqi Jiang, Jack Parker-Holder, Jakob Foerster
-
MART: Improving LLM Safety with Multi-round Automatic Red-Teaming,
arXiv, 2311.07689
, arxiv, pdf, cication: -1Suyu Ge, Chunting Zhou, Rui Hou, Madian Khabsa, Yi-Chia Wang, Qifan Wang, Jiawei Han, Yuning Mao
-
Summon a Demon and Bind it: A Grounded Theory of LLM Red Teaming in the Wild,
arXiv, 2311.06237
, arxiv, pdf, cication: -1Nanna Inie, Jonathan Stray, Leon Derczynski
-
Moral Foundations of Large Language Models,
arXiv, 2310.15337
, arxiv, pdf, cication: 7Marwa Abdulhai, Gregory Serapio-Garcia, Clément Crepy, Daria Valter, John Canny, Natasha Jaques
-
FLIRT: Feedback Loop In-context Red Teaming,
arXiv, 2308.04265
, arxiv, pdf, cication: 3Ninareh Mehrabi, Palash Goyal, Christophe Dupuy, Qian Hu, Shalini Ghosh, Richard Zemel, Kai-Wei Chang, Aram Galstyan, Rahul Gupta
-
Explore, Establish, Exploit: Red Teaming Language Models from Scratch,
arXiv, 2306.09442
, arxiv, pdf, cication: 16Stephen Casper, Jason Lin, Joe Kwon, Gatlen Culp, Dylan Hadfield-Menell
-
WalledEval: A Comprehensive Safety Evaluation Toolkit for Large Language Models,
arXiv, 2408.03837
, arxiv, pdf, cication: -1Prannaya Gupta, Le Qi Yau, Hao Han Low, I-Shiang Lee, Hugo Maximus Lim, Yu Xin Teoh, Jia Hng Koh, Dar Win Liew, Rishabh Bhardwaj, Rajat Bhardwaj
-
ShieldGemma: Generative AI Content Moderation Based on Gemma,
arXiv, 2407.21772
, arxiv, pdf, cication: -1Wenjun Zeng, Yuchi Liu, Ryan Mullins, Ludovic Peran, Joe Fernandez, Hamza Harkous, Karthik Narasimhan, Drew Proud, Piyush Kumar, Bhaktipriya Radharapu · (huggingface)
-
AgentPoison: Red-teaming LLM Agents via Poisoning Memory or Knowledge Bases,
arXiv, 2407.12784
, arxiv, pdf, cication: -1Zhaorun Chen, Zhen Xiang, Chaowei Xiao, Dawn Song, Bo Li
-
Towards Guaranteed Safe AI: A Framework for Ensuring Robust and Reliable AI Systems,
arXiv, 2405.06624
, arxiv, pdf, cication: 12David "davidad" Dalrymple, Joar Skalse, Yoshua Bengio, Stuart Russell, Max Tegmark, Sanjit Seshia, Steve Omohundro, Christian Szegedy, Ben Goldhaber, Nora Ammann · (mp.weixin.qq)
-
The GPT Dilemma: Foundation Models and the Shadow of Dual-Use,
arXiv, 2407.20442
, arxiv, pdf, cication: -1Alan Hickey
-
Improving Model Safety Behavior with Rule-Based Rewards | OpenAI
· (cdn.openai) · (safety-rbr-code-and-data - openai)
-
Safe Unlearning: A Surprisingly Effective and Generalizable Solution to Defend Against Jailbreak Attacks,
arXiv, 2407.02855
, arxiv, pdf, cication: -1Zhexin Zhang, Junxiao Yang, Pei Ke, Shiyao Cui, Chujie Zheng, Hongning Wang, Minlie Huang
· (SafeUnlearning - thu-coai)
-
WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs,
arXiv, 2406.18495
, arxiv, pdf, cication: -1Seungju Han, Kavel Rao, Allyson Ettinger, Liwei Jiang, Bill Yuchen Lin, Nathan Lambert, Yejin Choi, Nouha Dziri
-
Jailbreaking as a Reward Misspecification Problem,
arXiv, 2406.14393
, arxiv, pdf, cication: -1Zhihui Xie, Jiahui Gao, Lei Li, Zhenguo Li, Qi Liu, Lingpeng Kong
-
Adversarial Attacks on Multimodal Agents,
arXiv, 2406.12814
, arxiv, pdf, cication: -1Chen Henry Wu, Jing Yu Koh, Ruslan Salakhutdinov, Daniel Fried, Aditi Raghunathan · (agent-attack - ChenWu98)
-
Merging Improves Self-Critique Against Jailbreak Attacks,
arXiv, 2406.07188
, arxiv, pdf, cication: -1Victor Gallego
· (merging-self-critique-jailbreaks - vicgalle)
-
LLM Agents can Autonomously Exploit One-day Vulnerabilities,
arXiv, 2404.08144
, arxiv, pdf, cication: -1Richard Fang, Rohan Bindu, Akul Gupta, Daniel Kang · (mp.weixin.qq)
-
Introducing v0.5 of the AI Safety Benchmark from MLCommons,
arXiv, 2404.12241
, arxiv, pdf, cication: -1Bertie Vidgen, Adarsh Agrawal, Ahmed M. Ahmed, Victor Akinwande, Namir Al-Nuaimi, Najla Alfaraj, Elie Alhajjar, Lora Aroyo, Trupti Bavalatti, Borhane Blili-Hamelin
-
PrivacyBackdoor - ShanglunFengatETHZ
Privacy backdoors
-
What Was Your Prompt? A Remote Keylogging Attack on AI Assistants
-
JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models,
arXiv, 2404.01318
, arxiv, pdf, cication: -1Patrick Chao, Edoardo Debenedetti, Alexander Robey, Maksym Andriushchenko, Francesco Croce, Vikash Sehwag, Edgar Dobriban, Nicolas Flammarion, George J. Pappas, Florian Tramer
-
What's in Your "Safe" Data?: Identifying Benign Data that Breaks Safety,
arXiv, 2404.01099
, arxiv, pdf, cication: -1Luxi He, Mengzhou Xia, Peter Henderson
-
· (anthropic)
it includes a very large number of faux dialogues (~256) preceding the final question which effectively steers the model to produce harmful responses.
-
Don't Listen To Me: Understanding and Exploring Jailbreak Prompts of Large Language Models,
arXiv, 2403.17336
, arxiv, pdf, cication: -1Zhiyuan Yu, Xiaogeng Liu, Shunning Liang, Zach Cameron, Chaowei Xiao, Ning Zhang
-
Decoding Compressed Trust: Scrutinizing the Trustworthiness of Efficient LLMs Under Compression,
arXiv, 2403.15447
, arxiv, pdf, cication: -1Junyuan Hong, Jinhao Duan, Chenhui Zhang, Zhangheng Li, Chulin Xie, Kelsey Lieberman, James Diffenderfer, Brian Bartoldson, Ajay Jaiswal, Kaidi Xu
· (decoding-comp-trust.github)
quantization is better than pruning for maintaining efficiency and trustworthiness.
-
SafeDecoding: Defending against Jailbreak Attacks via Safety-Aware Decoding,
arXiv, 2402.08983
, arxiv, pdf, cication: 1Zhangchen Xu, Fengqing Jiang, Luyao Niu, Jinyuan Jia, Bill Yuchen Lin, Radha Poovendran · (SafeDecoding - uw-nsl)
-
DrAttack: Prompt Decomposition and Reconstruction Makes Powerful LLM Jailbreakers,
arXiv, 2402.16914
, arxiv, pdf, cication: -1Xirui Li, Ruochen Wang, Minhao Cheng, Tianyi Zhou, Cho-Jui Hsieh · (DrAttack - xirui-li)
-
Coercing LLMs to do and reveal (almost) anything,
arXiv, 2402.14020
, arxiv, pdf, cication: -1Jonas Geiping, Alex Stein, Manli Shu, Khalid Saifullah, Yuxin Wen, Tom Goldstein
-
How Easy is It to Fool Your Multimodal LLMs? An Empirical Analysis on Deceptive Prompts,
arXiv, 2402.13220
, arxiv, pdf, cication: -1Yusu Qian, Haotian Zhang, Yinfei Yang, Zhe Gan
-
LLM Agents can Autonomously Hack Websites,
arXiv, 2402.06664
, arxiv, pdf, cication: -1Richard Fang, Rohan Bindu, Akul Gupta, Qiusi Zhan, Daniel Kang
-
Robust Prompt Optimization for Defending Language Models Against Jailbreaking Attacks,
arXiv, 2401.17263
, arxiv, pdf, cication: -1Andy Zhou, Bo Li, Haohan Wang · (rpo - andyz245)
-
A Cross-Language Investigation into Jailbreak Attacks in Large Language Models,
arXiv, 2401.16765
, arxiv, pdf, cication: -1Jie Li, Yi Liu, Chongyang Liu, Ling Shi, Xiaoning Ren, Yaowen Zheng, Yang Liu, Yinxing Xue
-
Weak-to-Strong Jailbreaking on Large Language Models,
arXiv, 2401.17256
, arxiv, pdf, cication: -1Xuandong Zhao, Xianjun Yang, Tianyu Pang, Chao Du, Lei Li, Yu-Xiang Wang, William Yang Wang · (weak-to-strong - XuandongZhao)
-
Red Teaming Visual Language Models,
arXiv, 2401.12915
, arxiv, pdf, cication: -1Mukai Li, Lei Li, Yuwei Yin, Masood Ahmed, Zhenguang Liu, Qi Liu
-
AttackEval: How to Evaluate the Effectiveness of Jailbreak Attacking on Large Language Models,
arXiv, 2401.09002
, arxiv, pdf, cication: -1Dong shu, Mingyu Jin, Suiyuan Zhu, Beichen Wang, Zihao Zhou, Chong Zhang, Yongfeng Zhang
-
Open the Pandora's Box of LLMs: Jailbreaking LLMs through Representation Engineering,
arXiv, 2401.06824
, arxiv, pdf, cication: -1Tianlong Li, Xiaoqing Zheng, Xuanjing Huang
-
How Johnny Can Persuade LLMs to Jailbreak Them: Rethinking Persuasion to Challenge AI Safety by Humanizing LLMs,
arXiv, 2401.06373
, arxiv, pdf, cication: -1Yi Zeng, Hongpeng Lin, Jingwen Zhang, Diyi Yang, Ruoxi Jia, Weiyan Shi · (chats-lab.github) · (persuasive_jailbreaker - CHATS-lab)
-
Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training,
arXiv, 2401.05566
, arxiv, pdf, cication: -1Evan Hubinger, Carson Denison, Jesse Mu, Mike Lambert, Meg Tong, Monte MacDiarmid, Tamera Lanham, Daniel M. Ziegler, Tim Maxwell, Newton Cheng
· (qbitai)
-
Exploiting Novel GPT-4 APIs,
arXiv, 2312.14302
, arxiv, pdf, cication: -1Kellin Pelrine, Mohammad Taufeeque, Michał Zając, Euan McLean, Adam Gleave
-
adversarial-random-search-gpt4 - max-andr
Adversarial Attacks on GPT-4 via Simple Random Search [Dec 2023] · (andriushchenko)
-
Control Risk for Potential Misuse of Artificial Intelligence in Science,
arXiv, 2312.06632
, arxiv, pdf, cication: -1Jiyan He, Weitao Feng, Yaosen Min, Jingwei Yi, Kunsheng Tang, Shuai Li, Jie Zhang, Kejiang Chen, Wenbo Zhou, Xing Xie · (mp.weixin.qq)
-
Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations,
arXiv, 2312.06674
, arxiv, pdf, cication: -1Hakan Inan, Kartikeya Upasani, Jianfeng Chi, Rashi Rungta, Krithika Iyer, Yuning Mao, Michael Tontchev, Qing Hu, Brian Fuller, Davide Testuggine
-
Scalable and Transferable Black-Box Jailbreaks for Language Models via Persona Modulation,
arXiv, 2311.03348
, arxiv, pdf, cication: -1Rusheb Shah, Quentin Feuillade--Montixi, Soroush Pour, Arush Tagade, Stephen Casper, Javier Rando
-
Scalable Extraction of Training Data from (Production) Language Models,
arXiv, 2311.17035
, arxiv, pdf, cication: -1Milad Nasr, Nicholas Carlini, Jonathan Hayase, Matthew Jagielski, A. Feder Cooper, Daphne Ippolito, Christopher A. Choquette-Choo, Eric Wallace, Florian Tramèr, Katherine Lee · (qbitai)
-
Scalable AI Safety via Doubly-Efficient Debate,
arXiv, 2311.14125
, arxiv, pdf, cication: -1Jonah Brown-Cohen, Geoffrey Irving, Georgios Piliouras · (debate - google-deepmind)
-
DeepInception: Hypnotize Large Language Model to Be Jailbreaker,
arXiv, 2311.03191
, arxiv, pdf, cication: -1Xuan Li, Zhanke Zhou, Jianing Zhu, Jiangchao Yao, Tongliang Liu, Bo Han · (DeepInception - tmlr-group) · (deepinception.github)
-
Removing RLHF Protections in GPT-4 via Fine-Tuning,
arXiv, 2311.05553
, arxiv, pdf, cication: -1Qiusi Zhan, Richard Fang, Rohan Bindu, Akul Gupta, Tatsunori Hashimoto, Daniel Kang · (mp.weixin.qq)
-
Frontier Language Models are not Robust to Adversarial Arithmetic, or "What do I need to say so you agree 2+2=5?,
arXiv, 2311.07587
, arxiv, pdf, cication: -1C. Daniel Freeman, Laura Culp, Aaron Parisi, Maxwell L Bileschi, Gamaleldin F Elsayed, Alex Rizkowsky, Isabelle Simpson, Alex Alemi, Azade Nova, Ben Adlam
-
Unveiling Safety Vulnerabilities of Large Language Models,
arXiv, 2311.04124
, arxiv, pdf, cication: -1George Kour, Marcel Zalmanovici, Naama Zwerdling, Esther Goldbraich, Ora Nova Fandina, Ateret Anaby-Tavor, Orna Raz, Eitan Farchi
-
Managing AI Risks in an Era of Rapid Progress,
arXiv, 2310.17688
, arxiv, pdf, cication: 3Yoshua Bengio, Geoffrey Hinton, Andrew Yao, Dawn Song, Pieter Abbeel, Yuval Noah Harari, Ya-Qin Zhang, Lan Xue, Shai Shalev-Shwartz, Gillian Hadfield · (managing-ai-risks)
-
Jailbreaking Black Box Large Language Models in Twenty Queries,
arXiv, 2310.08419
, arxiv, pdf, cication: 3Patrick Chao, Alexander Robey, Edgar Dobriban, Hamed Hassani, George J. Pappas, Eric Wong · (qbitai)
-
How Robust is Google's Bard to Adversarial Image Attacks?,
arXiv, 2309.11751
, arxiv, pdf, cication: 1Yinpeng Dong, Huanran Chen, Jiawei Chen, Zhengwei Fang, Xiao Yang, Yichi Zhang, Yu Tian, Hang Su, Jun Zhu · (jiqizhixin)
-
Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!,
arXiv, 2310.03693
, arxiv, pdf, cication: 3Xiangyu Qi, Yi Zeng, Tinghao Xie, Pin-Yu Chen, Ruoxi Jia, Prateek Mittal, Peter Henderson · (mp.weixin.qq)
-
GPTFUZZER: Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts,
arXiv, 2309.10253
, arxiv, pdf, cication: 5Jiahao Yu, Xingwei Lin, Zheng Yu, Xinyu Xing · (gptfuzz - sherdencooper)
-
"Do Anything Now": Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models,
arXiv, 2308.03825
, arxiv, pdf, cication: 25Xinyue Shen, Zeyuan Chen, Michael Backes, Yun Shen, Yang Zhang
-
MasterKey: Automated Jailbreak Across Multiple Large Language Model Chatbots,
arXiv, 2307.08715
, arxiv, pdf, cication: 13Gelei Deng, Yi Liu, Yuekang Li, Kailong Wang, Ying Zhang, Zefeng Li, Haoyu Wang, Tianwei Zhang, Yang Liu · (qbitai)
-
Universal and Transferable Adversarial Attacks on Aligned Language Models,
arXiv, 2307.15043
, arxiv, pdf, cication: 58Andy Zou, Zifan Wang, J. Zico Kolter, Matt Fredrikson · (llm-attacks - llm-attacks) · (qbitai)
-
PUMA: Secure Inference of LLaMA-7B in Five Minutes,
arXiv, 2307.12533
, arxiv, pdf, cication: 3Ye Dong, Wen-jie Lu, Yancheng Zheng, Haoqi Wu, Derun Zhao, Jin Tan, Zhicong Huang, Cheng Hong, Tao Wei, Wenguang Chen
-
International Institutions for Advanced AI,
arXiv, 2307.04699
, arxiv, pdf, cication: 9Lewis Ho, Joslyn Barnhart, Robert Trager, Yoshua Bengio, Miles Brundage, Allison Carnegie, Rumman Chowdhury, Allan Dafoe, Gillian Hadfield, Margaret Levi
-
"Do Anything Now": Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models,
arXiv, 2308.03825
, arxiv, pdf, cication: 162Xinyue Shen, Zeyuan Chen, Michael Backes, Yun Shen, Yang Zhang · (jailbreak_llms - verazuo)
-
An Overview of Catastrophic AI Risks,
arXiv, 2306.12001
, arxiv, pdf, cication: 20Dan Hendrycks, Mantas Mazeika, Thomas Woodside · (mp.weixin.qq)
-
Jailbroken: How Does LLM Safety Training Fail?,
arXiv, 2307.02483
, arxiv, pdf, cication: 54Alexander Wei, Nika Haghtalab, Jacob Steinhardt
-
PromptBench: Towards Evaluating the Robustness of Large Language Models on Adversarial Prompts,
arXiv, 2306.04528
, arxiv, pdf, cication: 32Kaijie Zhu, Jindong Wang, Jiaheng Zhou, Zichen Wang, Hao Chen, Yidong Wang, Linyi Yang, Wei Ye, Yue Zhang, Neil Zhenqiang Gong
-
CJA_Comprehensive_Jailbreak_Assessment - Junjie-Chu
This is the public code repository of paper 'Comprehensive Assessment of Jailbreak Attacks Against LLMs'
-
Llama-Guard-3-8B - meta-llama 🤗
-
Prompt-Guard-86M - meta-llama 🤗
-
jailbreak_llms - verazuo
[CCS'24] A dataset consists of 15,140 ChatGPT prompts from Reddit, Discord, websites, and open-source datasets (including 1,405 jailbreak prompts).
-
prompt-injection-defenses - tldrsec
Every practical and proposed defense against prompt injection.
-
PurpleLlama - meta-llama
Set of tools to assess and improve LLM security.
-
ps-fuzz - prompt-security
Make your GenAI Apps Safe & Secure Test & harden your system prompt
-
PyRIT - Azure
The Python Risk Identification Tool for generative AI (PyRIT) is an open access automation framework to empower security professionals and machine learning engineers to proactively find risks in their generative AI systems.
-
ai-exploits - protectai
A collection of real world AI/ML exploits for responsibly disclosed vulnerabilities
-
chatgpt_system_prompt - LouisShark
store all chatgpt's system prompt
-
cipherchat - robustnlp
A framework to evaluate the generalization capability of safety alignment for LLMs