Skip to content

Latest commit

 

History

History
337 lines (235 loc) · 34 KB

awesome_llm_security.md

File metadata and controls

337 lines (235 loc) · 34 KB

Awesome llm security

Survey

  • AI Deception: A Survey of Examples, Risks, and Potential Solutions, patterns, 2024, arxiv, pdf, cication: 78

    Peter S. Park, Simon Goldstein, Aidan O'Gara, Michael Chen, Dan Hendrycks

  • JailbreakZoo: Survey, Landscapes, and Horizons in Jailbreaking Large Language and Vision-Language Models, arXiv, 2407.01599, arxiv, pdf, cication: -1

    Haibo Jin, Leyang Hu, Xinuo Li, Peiyan Zhang, Chonghan Chen, Jun Zhuang, Haohan Wang

    · (chonghan-chen) · (JailbreakZoo - Allen-piexl) Star

  • Against The Achilles' Heel: A Survey on Red Teaming for Generative Models, arXiv, 2404.00629, arxiv, pdf, cication: -1

    Lizhi Lin, Honglin Mu, Zenan Zhai, Minghan Wang, Yuxia Wang, Renxi Wang, Junjie Gao, Yixuan Zhang, Wanxiang Che, Timothy Baldwin

  • Breaking Down the Defenses: A Comparative Survey of Attacks on Large Language Models, arXiv, 2403.04786, arxiv, pdf, cication: -1

    Arijit Ghosh Chowdhury, Md Mofijul Islam, Vaibhav Kumar, Faysal Hossain Shezan, Vaibhav Kumar, Vinija Jain, Aman Chadha

  • Open-Sourcing Highly Capable Foundation Models: An evaluation of risks, benefits, and alternative methods for pursuing open-source objectives, arXiv, 2311.09227, arxiv, pdf, cication: -1

    Elizabeth Seger, Noemi Dreksler, Richard Moulange, Emily Dardaman, Jonas Schuett, K. Wei, Christoph Winter, Mackenzie Arnold, Seán Ó hÉigeartaigh, Anton Korinek

  • Adversarial Attacks on LLMs | Lil'Log

Red teaming

  • Challenges in Red Teaming AI Systems \ Anthropic

  • ALERT: A Comprehensive Benchmark for Assessing Large Language Models' Safety through Red Teaming, arXiv, 2404.08676, arxiv, pdf, cication: -1

    Simone Tedeschi, Felix Friedrich, Patrick Schramowski, Kristian Kersting, Roberto Navigli, Huu Nguyen, Bo Li

    · (ALERT - Babelscape) Star

  • Red-Teaming Language Models with DSPy | Haize Labs Blog 🕊️

  • Curiosity-driven Red-teaming for Large Language Models

    · (curiosity_redteam - Improbable-AI) Star

  • Recourse for reclamation: Chatting with generative language models, arXiv, 2403.14467, arxiv, pdf, cication: -1

    Jennifer Chien, Kevin R. McKee, Jackie Kay, William Isaac

  • Evaluating Frontier Models for Dangerous Capabilities, arXiv, 2403.13793, arxiv, pdf, cication: -1

    Mary Phuong, Matthew Aitchison, Elliot Catt, Sarah Cogan, Alexandre Kaskasoli, Victoria Krakovna, David Lindner, Matthew Rahtz, Yannis Assael, Sarah Hodkinson

  • A Safe Harbor for AI Evaluation and Red Teaming, arXiv, 2403.04893, arxiv, pdf, cication: -1

    Shayne Longpre, Sayash Kapoor, Kevin Klyman, Ashwin Ramaswami, Rishi Bommasani, Borhane Blili-Hamelin, Yangsibo Huang, Aviya Skowron, Zheng-Xin Yong, Suhas Kotha

  • Rainbow Teaming: Open-Ended Generation of Diverse Adversarial Prompts, arXiv, 2402.16822, arxiv, pdf, cication: -1

    Mikayel Samvelyan, Sharath Chandra Raparthy, Andrei Lupu, Eric Hambro, Aram H. Markosyan, Manish Bhatt, Yuning Mao, Minqi Jiang, Jack Parker-Holder, Jakob Foerster

  • MART: Improving LLM Safety with Multi-round Automatic Red-Teaming, arXiv, 2311.07689, arxiv, pdf, cication: -1

    Suyu Ge, Chunting Zhou, Rui Hou, Madian Khabsa, Yi-Chia Wang, Qifan Wang, Jiawei Han, Yuning Mao

  • Summon a Demon and Bind it: A Grounded Theory of LLM Red Teaming in the Wild, arXiv, 2311.06237, arxiv, pdf, cication: -1

    Nanna Inie, Jonathan Stray, Leon Derczynski

  • Moral Foundations of Large Language Models, arXiv, 2310.15337, arxiv, pdf, cication: 7

    Marwa Abdulhai, Gregory Serapio-Garcia, Clément Crepy, Daria Valter, John Canny, Natasha Jaques

  • FLIRT: Feedback Loop In-context Red Teaming, arXiv, 2308.04265, arxiv, pdf, cication: 3

    Ninareh Mehrabi, Palash Goyal, Christophe Dupuy, Qian Hu, Shalini Ghosh, Richard Zemel, Kai-Wei Chang, Aram Galstyan, Rahul Gupta

  • Explore, Establish, Exploit: Red Teaming Language Models from Scratch, arXiv, 2306.09442, arxiv, pdf, cication: 16

    Stephen Casper, Jason Lin, Joe Kwon, Gatlen Culp, Dylan Hadfield-Menell

Papers

  • WalledEval: A Comprehensive Safety Evaluation Toolkit for Large Language Models, arXiv, 2408.03837, arxiv, pdf, cication: -1

    Prannaya Gupta, Le Qi Yau, Hao Han Low, I-Shiang Lee, Hugo Maximus Lim, Yu Xin Teoh, Jia Hng Koh, Dar Win Liew, Rishabh Bhardwaj, Rajat Bhardwaj

  • ShieldGemma: Generative AI Content Moderation Based on Gemma, arXiv, 2407.21772, arxiv, pdf, cication: -1

    Wenjun Zeng, Yuchi Liu, Ryan Mullins, Ludovic Peran, Joe Fernandez, Hamza Harkous, Karthik Narasimhan, Drew Proud, Piyush Kumar, Bhaktipriya Radharapu · (huggingface)

  • AgentPoison: Red-teaming LLM Agents via Poisoning Memory or Knowledge Bases, arXiv, 2407.12784, arxiv, pdf, cication: -1

    Zhaorun Chen, Zhen Xiang, Chaowei Xiao, Dawn Song, Bo Li

  • Towards Guaranteed Safe AI: A Framework for Ensuring Robust and Reliable AI Systems, arXiv, 2405.06624, arxiv, pdf, cication: 12

    David "davidad" Dalrymple, Joar Skalse, Yoshua Bengio, Stuart Russell, Max Tegmark, Sanjit Seshia, Steve Omohundro, Christian Szegedy, Ben Goldhaber, Nora Ammann · (mp.weixin.qq)

  • The GPT Dilemma: Foundation Models and the Shadow of Dual-Use, arXiv, 2407.20442, arxiv, pdf, cication: -1

    Alan Hickey

  • Improving Model Safety Behavior with Rule-Based Rewards | OpenAI

    · (cdn.openai) · (safety-rbr-code-and-data - openai) Star

  • Safe Unlearning: A Surprisingly Effective and Generalizable Solution to Defend Against Jailbreak Attacks, arXiv, 2407.02855, arxiv, pdf, cication: -1

    Zhexin Zhang, Junxiao Yang, Pei Ke, Shiyao Cui, Chujie Zheng, Hongning Wang, Minlie Huang

    · (SafeUnlearning - thu-coai) Star

  • WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs, arXiv, 2406.18495, arxiv, pdf, cication: -1

    Seungju Han, Kavel Rao, Allyson Ettinger, Liwei Jiang, Bill Yuchen Lin, Nathan Lambert, Yejin Choi, Nouha Dziri

  • Jailbreaking as a Reward Misspecification Problem, arXiv, 2406.14393, arxiv, pdf, cication: -1

    Zhihui Xie, Jiahui Gao, Lei Li, Zhenguo Li, Qi Liu, Lingpeng Kong

  • Adversarial Attacks on Multimodal Agents, arXiv, 2406.12814, arxiv, pdf, cication: -1

    Chen Henry Wu, Jing Yu Koh, Ruslan Salakhutdinov, Daniel Fried, Aditi Raghunathan · (agent-attack - ChenWu98) Star

  • Merging Improves Self-Critique Against Jailbreak Attacks, arXiv, 2406.07188, arxiv, pdf, cication: -1

    Victor Gallego

    · (merging-self-critique-jailbreaks - vicgalle) Star

  • LLM Agents can Autonomously Exploit One-day Vulnerabilities, arXiv, 2404.08144, arxiv, pdf, cication: -1

    Richard Fang, Rohan Bindu, Akul Gupta, Daniel Kang · (mp.weixin.qq)

  • Introducing v0.5 of the AI Safety Benchmark from MLCommons, arXiv, 2404.12241, arxiv, pdf, cication: -1

    Bertie Vidgen, Adarsh Agrawal, Ahmed M. Ahmed, Victor Akinwande, Namir Al-Nuaimi, Najla Alfaraj, Elie Alhajjar, Lora Aroyo, Trupti Bavalatti, Borhane Blili-Hamelin

  • PrivacyBackdoor - ShanglunFengatETHZ Star

    Privacy backdoors

  • What Was Your Prompt? A Remote Keylogging Attack on AI Assistants

    · (youtu) · (twitter)

  • JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models, arXiv, 2404.01318, arxiv, pdf, cication: -1

    Patrick Chao, Edoardo Debenedetti, Alexander Robey, Maksym Andriushchenko, Francesco Croce, Vikash Sehwag, Edgar Dobriban, Nicolas Flammarion, George J. Pappas, Florian Tramer

  • What's in Your "Safe" Data?: Identifying Benign Data that Breaks Safety, arXiv, 2404.01099, arxiv, pdf, cication: -1

    Luxi He, Mengzhou Xia, Peter Henderson

  • Many-shot Jailbreaking

    · (anthropic)

    • it includes a very large number of faux dialogues (~256) preceding the final question which effectively steers the model to produce harmful responses.
  • Don't Listen To Me: Understanding and Exploring Jailbreak Prompts of Large Language Models, arXiv, 2403.17336, arxiv, pdf, cication: -1

    Zhiyuan Yu, Xiaogeng Liu, Shunning Liang, Zach Cameron, Chaowei Xiao, Ning Zhang

  • Decoding Compressed Trust: Scrutinizing the Trustworthiness of Efficient LLMs Under Compression, arXiv, 2403.15447, arxiv, pdf, cication: -1

    Junyuan Hong, Jinhao Duan, Chenhui Zhang, Zhangheng Li, Chulin Xie, Kelsey Lieberman, James Diffenderfer, Brian Bartoldson, Ajay Jaiswal, Kaidi Xu

    · (decoding-comp-trust.github)

    • quantization is better than pruning for maintaining efficiency and trustworthiness.
  • SafeDecoding: Defending against Jailbreak Attacks via Safety-Aware Decoding, arXiv, 2402.08983, arxiv, pdf, cication: 1

    Zhangchen Xu, Fengqing Jiang, Luyao Niu, Jinyuan Jia, Bill Yuchen Lin, Radha Poovendran · (SafeDecoding - uw-nsl) Star

  • DrAttack: Prompt Decomposition and Reconstruction Makes Powerful LLM Jailbreakers, arXiv, 2402.16914, arxiv, pdf, cication: -1

    Xirui Li, Ruochen Wang, Minhao Cheng, Tianyi Zhou, Cho-Jui Hsieh · (DrAttack - xirui-li) Star

  • Coercing LLMs to do and reveal (almost) anything, arXiv, 2402.14020, arxiv, pdf, cication: -1

    Jonas Geiping, Alex Stein, Manli Shu, Khalid Saifullah, Yuxin Wen, Tom Goldstein

  • How Easy is It to Fool Your Multimodal LLMs? An Empirical Analysis on Deceptive Prompts, arXiv, 2402.13220, arxiv, pdf, cication: -1

    Yusu Qian, Haotian Zhang, Yinfei Yang, Zhe Gan

  • LLM Agents can Autonomously Hack Websites, arXiv, 2402.06664, arxiv, pdf, cication: -1

    Richard Fang, Rohan Bindu, Akul Gupta, Qiusi Zhan, Daniel Kang

  • Robust Prompt Optimization for Defending Language Models Against Jailbreaking Attacks, arXiv, 2401.17263, arxiv, pdf, cication: -1

    Andy Zhou, Bo Li, Haohan Wang · (rpo - andyz245) Star

  • A Cross-Language Investigation into Jailbreak Attacks in Large Language Models, arXiv, 2401.16765, arxiv, pdf, cication: -1

    Jie Li, Yi Liu, Chongyang Liu, Ling Shi, Xiaoning Ren, Yaowen Zheng, Yang Liu, Yinxing Xue

  • Weak-to-Strong Jailbreaking on Large Language Models, arXiv, 2401.17256, arxiv, pdf, cication: -1

    Xuandong Zhao, Xianjun Yang, Tianyu Pang, Chao Du, Lei Li, Yu-Xiang Wang, William Yang Wang · (weak-to-strong - XuandongZhao) Star

  • Red Teaming Visual Language Models, arXiv, 2401.12915, arxiv, pdf, cication: -1

    Mukai Li, Lei Li, Yuwei Yin, Masood Ahmed, Zhenguang Liu, Qi Liu

  • AttackEval: How to Evaluate the Effectiveness of Jailbreak Attacking on Large Language Models, arXiv, 2401.09002, arxiv, pdf, cication: -1

    Dong shu, Mingyu Jin, Suiyuan Zhu, Beichen Wang, Zihao Zhou, Chong Zhang, Yongfeng Zhang

  • Open the Pandora's Box of LLMs: Jailbreaking LLMs through Representation Engineering, arXiv, 2401.06824, arxiv, pdf, cication: -1

    Tianlong Li, Xiaoqing Zheng, Xuanjing Huang

  • How Johnny Can Persuade LLMs to Jailbreak Them: Rethinking Persuasion to Challenge AI Safety by Humanizing LLMs, arXiv, 2401.06373, arxiv, pdf, cication: -1

    Yi Zeng, Hongpeng Lin, Jingwen Zhang, Diyi Yang, Ruoxi Jia, Weiyan Shi · (chats-lab.github) · (persuasive_jailbreaker - CHATS-lab) Star

  • Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training, arXiv, 2401.05566, arxiv, pdf, cication: -1

    Evan Hubinger, Carson Denison, Jesse Mu, Mike Lambert, Meg Tong, Monte MacDiarmid, Tamera Lanham, Daniel M. Ziegler, Tim Maxwell, Newton Cheng

    · (qbitai)

  • Exploiting Novel GPT-4 APIs, arXiv, 2312.14302, arxiv, pdf, cication: -1

    Kellin Pelrine, Mohammad Taufeeque, Michał Zając, Euan McLean, Adam Gleave

  • adversarial-random-search-gpt4 - max-andr Star

    Adversarial Attacks on GPT-4 via Simple Random Search [Dec 2023] · (andriushchenko)

  • Control Risk for Potential Misuse of Artificial Intelligence in Science, arXiv, 2312.06632, arxiv, pdf, cication: -1

    Jiyan He, Weitao Feng, Yaosen Min, Jingwei Yi, Kunsheng Tang, Shuai Li, Jie Zhang, Kejiang Chen, Wenbo Zhou, Xing Xie · (mp.weixin.qq)

  • Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations, arXiv, 2312.06674, arxiv, pdf, cication: -1

    Hakan Inan, Kartikeya Upasani, Jianfeng Chi, Rashi Rungta, Krithika Iyer, Yuning Mao, Michael Tontchev, Qing Hu, Brian Fuller, Davide Testuggine

    · (ai.meta) · (pdf)

  • Scalable and Transferable Black-Box Jailbreaks for Language Models via Persona Modulation, arXiv, 2311.03348, arxiv, pdf, cication: -1

    Rusheb Shah, Quentin Feuillade--Montixi, Soroush Pour, Arush Tagade, Stephen Casper, Javier Rando

  • Scalable Extraction of Training Data from (Production) Language Models, arXiv, 2311.17035, arxiv, pdf, cication: -1

    Milad Nasr, Nicholas Carlini, Jonathan Hayase, Matthew Jagielski, A. Feder Cooper, Daphne Ippolito, Christopher A. Choquette-Choo, Eric Wallace, Florian Tramèr, Katherine Lee · (qbitai)

  • Scalable AI Safety via Doubly-Efficient Debate, arXiv, 2311.14125, arxiv, pdf, cication: -1

    Jonah Brown-Cohen, Geoffrey Irving, Georgios Piliouras · (debate - google-deepmind) Star

  • DeepInception: Hypnotize Large Language Model to Be Jailbreaker, arXiv, 2311.03191, arxiv, pdf, cication: -1

    Xuan Li, Zhanke Zhou, Jianing Zhu, Jiangchao Yao, Tongliang Liu, Bo Han · (DeepInception - tmlr-group) Star · (deepinception.github)

  • Removing RLHF Protections in GPT-4 via Fine-Tuning, arXiv, 2311.05553, arxiv, pdf, cication: -1

    Qiusi Zhan, Richard Fang, Rohan Bindu, Akul Gupta, Tatsunori Hashimoto, Daniel Kang · (mp.weixin.qq)

  • Frontier Language Models are not Robust to Adversarial Arithmetic, or "What do I need to say so you agree 2+2=5?, arXiv, 2311.07587, arxiv, pdf, cication: -1

    C. Daniel Freeman, Laura Culp, Aaron Parisi, Maxwell L Bileschi, Gamaleldin F Elsayed, Alex Rizkowsky, Isabelle Simpson, Alex Alemi, Azade Nova, Ben Adlam

  • Unveiling Safety Vulnerabilities of Large Language Models, arXiv, 2311.04124, arxiv, pdf, cication: -1

    George Kour, Marcel Zalmanovici, Naama Zwerdling, Esther Goldbraich, Ora Nova Fandina, Ateret Anaby-Tavor, Orna Raz, Eitan Farchi

  • Managing AI Risks in an Era of Rapid Progress, arXiv, 2310.17688, arxiv, pdf, cication: 3

    Yoshua Bengio, Geoffrey Hinton, Andrew Yao, Dawn Song, Pieter Abbeel, Yuval Noah Harari, Ya-Qin Zhang, Lan Xue, Shai Shalev-Shwartz, Gillian Hadfield · (managing-ai-risks)

  • Jailbreaking Black Box Large Language Models in Twenty Queries, arXiv, 2310.08419, arxiv, pdf, cication: 3

    Patrick Chao, Alexander Robey, Edgar Dobriban, Hamed Hassani, George J. Pappas, Eric Wong · (qbitai)

  • How Robust is Google's Bard to Adversarial Image Attacks?, arXiv, 2309.11751, arxiv, pdf, cication: 1

    Yinpeng Dong, Huanran Chen, Jiawei Chen, Zhengwei Fang, Xiao Yang, Yichi Zhang, Yu Tian, Hang Su, Jun Zhu · (jiqizhixin)

  • Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!, arXiv, 2310.03693, arxiv, pdf, cication: 3

    Xiangyu Qi, Yi Zeng, Tinghao Xie, Pin-Yu Chen, Ruoxi Jia, Prateek Mittal, Peter Henderson · (mp.weixin.qq)

  • GPTFUZZER: Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts, arXiv, 2309.10253, arxiv, pdf, cication: 5

    Jiahao Yu, Xingwei Lin, Zheng Yu, Xinyu Xing · (gptfuzz - sherdencooper) Star

  • "Do Anything Now": Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models, arXiv, 2308.03825, arxiv, pdf, cication: 25

    Xinyue Shen, Zeyuan Chen, Michael Backes, Yun Shen, Yang Zhang

  • MasterKey: Automated Jailbreak Across Multiple Large Language Model Chatbots, arXiv, 2307.08715, arxiv, pdf, cication: 13

    Gelei Deng, Yi Liu, Yuekang Li, Kailong Wang, Ying Zhang, Zefeng Li, Haoyu Wang, Tianwei Zhang, Yang Liu · (qbitai)

  • Universal and Transferable Adversarial Attacks on Aligned Language Models, arXiv, 2307.15043, arxiv, pdf, cication: 58

    Andy Zou, Zifan Wang, J. Zico Kolter, Matt Fredrikson · (llm-attacks - llm-attacks) Star · (qbitai)

  • PUMA: Secure Inference of LLaMA-7B in Five Minutes, arXiv, 2307.12533, arxiv, pdf, cication: 3

    Ye Dong, Wen-jie Lu, Yancheng Zheng, Haoqi Wu, Derun Zhao, Jin Tan, Zhicong Huang, Cheng Hong, Tao Wei, Wenguang Chen

  • International Institutions for Advanced AI, arXiv, 2307.04699, arxiv, pdf, cication: 9

    Lewis Ho, Joslyn Barnhart, Robert Trager, Yoshua Bengio, Miles Brundage, Allison Carnegie, Rumman Chowdhury, Allan Dafoe, Gillian Hadfield, Margaret Levi

  • "Do Anything Now": Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models, arXiv, 2308.03825, arxiv, pdf, cication: 162

    Xinyue Shen, Zeyuan Chen, Michael Backes, Yun Shen, Yang Zhang · (jailbreak_llms - verazuo) Star

  • An Overview of Catastrophic AI Risks, arXiv, 2306.12001, arxiv, pdf, cication: 20

    Dan Hendrycks, Mantas Mazeika, Thomas Woodside · (mp.weixin.qq)

  • Jailbroken: How Does LLM Safety Training Fail?, arXiv, 2307.02483, arxiv, pdf, cication: 54

    Alexander Wei, Nika Haghtalab, Jacob Steinhardt

  • PromptBench: Towards Evaluating the Robustness of Large Language Models on Adversarial Prompts, arXiv, 2306.04528, arxiv, pdf, cication: 32

    Kaijie Zhu, Jindong Wang, Jiaheng Zhou, Zichen Wang, Hao Chen, Yidong Wang, Linyi Yang, Wei Ye, Yue Zhang, Neil Zhenqiang Gong

Projects

  • CJA_Comprehensive_Jailbreak_Assessment - Junjie-Chu Star

    This is the public code repository of paper 'Comprehensive Assessment of Jailbreak Attacks Against LLMs'

  • Llama-Guard-3-8B - meta-llama 🤗

  • Prompt-Guard-86M - meta-llama 🤗

  • jailbreak_llms - verazuo Star

    [CCS'24] A dataset consists of 15,140 ChatGPT prompts from Reddit, Discord, websites, and open-source datasets (including 1,405 jailbreak prompts).

  • prompt-injection-defenses - tldrsec Star

    Every practical and proposed defense against prompt injection.

  • PurpleLlama - meta-llama Star

    Set of tools to assess and improve LLM security.

  • ps-fuzz - prompt-security Star

    Make your GenAI Apps Safe & Secure Test & harden your system prompt

  • PyRIT - Azure Star

    The Python Risk Identification Tool for generative AI (PyRIT) is an open access automation framework to empower security professionals and machine learning engineers to proactively find risks in their generative AI systems.

  • ai-exploits - protectai Star

    A collection of real world AI/ML exploits for responsibly disclosed vulnerabilities

  • chatgpt_system_prompt - LouisShark Star

    store all chatgpt's system prompt

  • cipherchat - robustnlp Star

    A framework to evaluate the generalization capability of safety alignment for LLMs

Other

Extra reference