- [2024/07] Exploring Scaling Trends in LLM Robustness
- [2024/07] The Dark Side of Function Calling: Pathways to Jailbreaking Large Language Models
- [2024/07] Can Large Language Models Automatically Jailbreak GPT-4V?
- [2024/07] PrimeGuard: Safe and Helpful LLMs through Tuning-Free Routing
- [2024/07] Figure it Out: Analyzing-based Jailbreak Attack on Large Language Models
- [2024/07] RedAgent: Red Teaming Large Language Models with Context-aware Autonomous Language Agent
- [2024/07] Arondight: Red Teaming Large Vision Language Models with Auto-generated Multi-modal Jailbreak Prompts
- [2024/07] When Do Universal Image Jailbreaks Transfer Between Vision-Language Models?
- [2024/07] Imposter.AI: Adversarial Attacks with Hidden Intentions towards Aligned Large Language Models
- [2024/07] Does Refusal Training in LLMs Generalize to the Past Tense?
- [2024/07] Continuous Embedding Attacks via Clipped Inputs in Jailbreaking Large Language Models
- [2024/07] Jailbreak Attacks and Defenses Against Large Language Models: A Survey
- [2024/07] DART: Deep Adversarial Automated Red Teaming for LLM Safety
- [2024/07] JailbreakZoo: Survey, Landscapes, and Horizons in Jailbreaking Large Language and Vision-Language Models
- [2024/07] SoP: Unlock the Power of Social Facilitation for Automatic Jailbreak Attack
- [2024/07] Image-to-Text Logic Jailbreak: Your Imagination can Help You Do Anything
- [2024/07] A False Sense of Safety: Unsafe Information Leakage in 'Safe' AI Responses
- [2024/07] Safe Unlearning: A Surprisingly Effective and Generalizable Solution to Defend Against Jailbreak Attacks
- [2024/07] Badllama 3: removing safety finetuning from Llama 3 in minutes
- [2024/06] Virtual Context: Enhancing Jailbreak Attacks with Special Token Injection
- [2024/06] Covert Malicious Finetuning: Challenges in Safeguarding LLM Adaptation
- [2024/06] Poisoned LangChain: Jailbreak LLMs by LangChain
- [2024/06] WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs
- [2024/06] WildTeaming at Scale: From In-the-Wild Jailbreaks to (Adversarially) Safer Language Models
- [2024/06] SafeAligner: Safety Alignment against Jailbreak Attacks via Response Disparity Guidance
- [2024/06] Adversaries Can Misuse Combinations of Safe Models
- [2024/06] Jailbreak Paradox: The Achilles' Heel of LLMs
- [2024/06] "Not Aligned" is Not "Malicious": Being Careful about Hallucinations of Large Language Models' Jailbreak
- [2024/06] Knowledge-to-Jailbreak: One Knowledge Point Worth One Attack
- [2024/06] Understanding Jailbreak Success: A Study of Latent Space Dynamics in Large Language Models
- [2024/06] StructuralSleight: Automated Jailbreak Attacks on Large Language Models Utilizing Uncommon Text-Encoded Structure
- [2024/06] When LLM Meets DRL: Advancing Jailbreaking Efficiency via DRL-guided Search
- [2024/06] RL-JACK: Reinforcement Learning-powered Black-box Jailbreaking Attack against LLMs
- [2024/06] Bag of Tricks: Benchmarking of Jailbreak Attacks on LLMs
- [2024/06] MLLMGuard: A Multi-dimensional Safety Evaluation Suite for Multimodal Large Language Models
- [2024/06] Merging Improves Self-Critique Against Jailbreak Attacks
- [2024/06] How Alignment and Jailbreak Work: Explain LLM Safety through Intermediate Hidden States
- [2024/06] SelfDefend: LLMs Can Defend Themselves against Jailbreaking in a Practical Manner
- [2024/06] Unveiling the Safety of GPT-4o: An Empirical Study using Jailbreak Attacks
- [2024/06] Adversarial Tuning: Defending Against Jailbreak Attacks for LLMs
- [2024/06] Improving Alignment and Robustness with Short Circuiting
- [2024/06] Are PPO-ed Language Models Hackable?
- [2024/06] Cross-Modal Safety Alignment: Is textual unlearning all you need?
- [2024/06] Defending Large Language Models Against Attacks With Residual Stream Activation Analysis
- [2024/06] Jailbreak Vision Language Models via Bi-Modal Adversarial Prompt
- [2024/06] AutoJailbreak: Exploring Jailbreak Attacks and Defenses through a Dependency Lens
- [2024/06] Improved Few-Shot Jailbreaking Can Circumvent Aligned Language Models and Their Defenses
- [2024/06] BELLS: A Framework Towards Future Proof Benchmarks for the Evaluation of LLM Safeguards
- [2024/05] Improved Techniques for Optimization-Based Jailbreaking on Large Language Models
- [2024/05] Jailbreaking Large Language Models Against Moderation Guardrails via Cipher Characters
- [2024/05] Visual-RolePlay: Universal Jailbreak Attack on MultiModal Large Language Models via Role-playing Image Characte
- [2024/05] Cross-Modality Jailbreak and Mismatched Attacks on Medical Multimodal Large Language Models
- [2024/05] Improved Generation of Adversarial Examples Against Safety-aligned LLMs
- [2024/05] Robustifying Safety-Aligned Large Language Models through Clean Data Curation
- [2024/05] Defensive Prompt Patch: A Robust and Interpretable Defense of LLMs against Jailbreak Attacks
- [2024/05] Voice Jailbreak Attacks Against GPT-4o
- [2024/05] Defending Large Language Models Against Jailbreak Attacks via Layer-specific Editing
- [2024/05] Automatic Jailbreaking of the Text-to-Image Generative AI Systems
- [2024/05] Hacc-Man: An Arcade Game for Jailbreaking LLMs
- [2024/05] Efficient Adversarial Training in LLMs with Continuous Attacks
- [2024/05] JailbreakEval: An Integrated Safety Evaluator Toolkit for Assessing Jailbreaks Against Large Language Models
- [2024/05] Cross-Task Defense: Instruction-Tuning LLMs for Content Safety
- [2024/05] Lockpicking LLMs: A Logit-Based Jailbreak Using Token-level Manipulation
- [2024/05] GPT-4 Jailbreaks Itself with Near-Perfect Success Using Self-Explanation
- [2024/05] Chain of Attack: a Semantic-Driven Contextual Multi-Turn attacker for LLM
- [2024/05] Can LLMs Deeply Detect Complex Malicious Queries? A Framework for Jailbreaking via Obfuscating Intent
- [2024/04] Don't Say No: Jailbreaking LLM by Suppressing Refusal
- [2024/04] Universal Adversarial Triggers Are Not Universal
- [2024/04] AdvPrompter: Fast Adaptive Adversarial Prompting for LLMs
- [2024/04] The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions
- [2024/04] Competition Report: Finding Universal Jailbreak Backdoors in Aligned LLMs
- [2024/04] Protecting Your LLMs with Information Bottleneck
- [2024/04] JailbreakLens: Visual Analysis of Jailbreak Attacks Against Large Language Models
- [2024/04] AmpleGCG: Learning a Universal and Transferable Generative Model of Adversarial Suffixes for Jailbreaking Both Open and Closed LLMs
- [2024/04] Sandwich attack: Multi-language Mixture Adaptive Attack on LLMs
- [2024/04] AEGIS: Online Adaptive AI Content Safety Moderation with Ensemble of LLM Experts
- [2024/04] Eraser: Jailbreaking Defense in Large Language Models via Unlearning Harmful Knowledge
- [2024/04] Take a Look at it! Rethinking How to Evaluate Language Model Jailbreak
- [2024/04] Unbridled Icarus: A Survey of the Potential Perils of Image Inputs in Multimodal Large Language Model Security
- [2024/04] Increased LLM Vulnerabilities from Fine-tuning and Quantization
- [2024/04] Red Teaming GPT-4V: Are GPT-4V Safe Against Uni/Multi-Modal Jailbreak Attacks?
- [2024/04] Jailbreaking Prompt Attack: A Controllable Adversarial Attack against Diffusion Models
- [2024/04] JailBreakV-28K: A Benchmark for Assessing the Robustness of MultiModal Large Language Models against Jailbreak Attacks
- [2024/04] JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models
- [2024/04] Great, Now Write an Article About That: The Crescendo Multi-Turn LLM Jailbreak Attack
- [2024/04] Jailbreaking Leading Safety-Aligned LLMs with Simple Adaptive Attacks
- [2024/04] Many-shot Jailbreaking
- [2024/03] Against The Achilles' Heel: A Survey on Red Teaming for Generative Models
- [2024/03] Don't Listen To Me: Understanding and Exploring Jailbreak Prompts of Large Language Models
- [2024/03] Jailbreaking is Best Solved by Definition
- [2024/03] Detoxifying Large Language Models via Knowledge Editing
- [2024/03] RigorLLM: Resilient Guardrails for Large Language Models against Undesired Content
- [2024/03] Images are Achilles’ Heel of Alignment: Exploiting Visual Vulnerabilities for Jailbreaking Multimodal Large Language Models
- [2024/03] AdaShield: Safeguarding Multimodal Large Language Models from Structure-based Attack via Adaptive Shield Prompting
- [2024/03] Tastle: Distract Large Language Models for Automatic Jailbreak Attack
- [2024/03] Exploring Safety Generalization Challenges of Large Language Models via Code
- [2024/03] AutoDefense: Multi-Agent LLM Defense against Jailbreak Attacks
- [2024/03] Gradient Cuff: Detecting Jailbreak Attacks on Large Language Models by Exploring Refusal Loss Landscapes
- [2024/02] Leveraging the Context through Multi-Round Interactions for Jailbreaking Attacks
- [2024/02] Making Them Ask and Answer: Jailbreaking Large Language Models in Few Queries via Disguise and Reconstruction
- [2024/02] DrAttack: Prompt Decomposition and Reconstruction Makes Powerful LLM Jailbreakers
- [2024/02] GUARD: Role-playing to Generate Natural-language Jailbreakings to Test Guideline Adherence of Large Language Models
- [2024/02] CodeChameleon: Personalized Encryption Framework for Jailbreaking Large Language Models
- [2024/02] PRP: Propagating Universal Perturbations to Attack Large Language Model Guard-Rails
- [2024/02] Defending Large Language Models against Jailbreak Attacks via Semantic Smoothing
- [2024/02] LLMs Can Defend Themselves Against Jailbreaking in a Practical Manner: A Vision Paper
- [2024/02] From Noise to Clarity: Unraveling the Adversarial Suffix of Large Language Model Attacks via Translation of Text Embeddings
- [2024/02] Semantic Mirror Jailbreak: Genetic Algorithm Based Jailbreak Prompts Against Open-source LLMs
- [2024/02] Is the System Message Really Important to Jailbreaks in Large Language Models?
- [2024/02] Break the Breakout: Reinventing LM Defense Against Jailbreak Attacks with Self-Refinement
- [2024/02] How (un)ethical are instruction-centric responses of LLMs? Unveiling the vulnerabilities of safety guardrails to harmful queries
- [2024/02] Mitigating Fine-tuning Jailbreak Attack with Backdoor Enhanced Alignment
- [2024/02] LLM Jailbreak Attack versus Defense Techniques -- A Comprehensive Study
- [2024/02] Coercing LLMs to do and reveal (almost) anything
- [2024/02] GradSafe: Detecting Unsafe Prompts for LLMs via Safety-Critical Gradient Analysis
- [2024/02] Query-Based Adversarial Prompt Generation
- [2024/02] ArtPrompt: ASCII Art-based Jailbreak Attacks against Aligned LLMs
- [2024/02] SPML: A DSL for Defending Language Models Against Prompt Attacks
- [2024/02] A StrongREJECT for Empty Jailbreaks
- [2024/02] Jailbreaking Proprietary Large Language Models using Word Substitution Cipher
- [2024/02] ToolSword: Unveiling Safety Issues of Large Language Models in Tool Learning Across Three Stages
- [2024/02] PAL: Proxy-Guided Black-Box Attack on Large Language Models
- [2024/02] Attacking Large Language Models with Projected Gradient Descent
- [2024/02] SafeDecoding: Defending against Jailbreak Attacks via Safety-Aware Decoding
- [2024/02] Play Guessing Game with LLM: Indirect Jailbreak Attack with Implicit Clues
- [2024/02] COLD-Attack: Jailbreaking LLMs with Stealthiness and Controllability
- [2024/02] Agent Smith: A Single Image Can Jailbreak One Million Multimodal LLM Agents Exponentially Fast
- [2024/02] Pandora: Jailbreak GPTs by Retrieval Augmented Generation Poisoning
- [2024/02] Comprehensive Assessment of Jailbreak Attacks Against LLMs
- [2024/02] Safety Fine-Tuning at (Almost) No Cost: A Baseline for Vision Large Language Models
- [2024/02] HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal
- [2024/02] Jailbreaking Attack against Multimodal Large Language Model
- [2024/02] Prompt-Driven LLM Safeguarding via Directed Representation Optimization
- [2024/01] On Prompt-Driven Safeguarding for Large Language Models
- [2024/01] A Cross-Language Investigation into Jailbreak Attacks in Large Language Models
- [2024/01] Weak-to-Strong Jailbreaking on Large Language Models
- [2024/01] Robust Prompt Optimization for Defending Language Models Against Jailbreaking Attacks
- [2024/01] Jailbreaking GPT-4V via Self-Adversarial Attacks with System Prompts
- [2024/01] PsySafe: A Comprehensive Framework for Psychological-based Attack, Defense, and Evaluation of Multi-agent System Safety
- [2024/01] Sowing the Wind, Reaping the Whirlwind: The Impact of Editing Language Models
- [2024/01] Pruning for Protection: Increasing Jailbreak Resistance in Aligned LLMs Without Fine-Tuning
- [2024/01] All in How You Ask for It: Simple Black-Box Method for Jailbreak Attacks
- [2024/01] AttackEval: How to Evaluate the Effectiveness of Jailbreak Attacking on Large Language Models
- [2024/01] Intention Analysis Prompting Makes Large Language Models A Good Jailbreak Defender
- [2024/01] How Johnny Can Persuade LLMs to Jailbreak Them: Rethinking Persuasion to Challenge AI Safety by Humanizing LLMs
- [2023/12] A Comprehensive Survey of Attack Techniques, Implementation, and Mitigation Strategies in Large Language Models
- [2023/12] Analyzing the Inherent Response Tendency of LLMs: Real-World Instructions-Driven Jailbreak
- [2023/12] Goal-Oriented Prompt Attack and Safety Evaluation for LLMs
- [2023/12] Tree of Attacks: Jailbreaking Black-Box LLMs Automatically
- [2023/12] Safety Alignment in NLP Tasks: Weakly Aligned Summarization as an In-Context Attack
- [2023/12] A Mutation-Based Method for Multi-Modal Jailbreaking Attack Detection
- [2023/12] Adversarial Attacks on GPT-4 via Simple Random Search
- [2023/12] On Large Language Models’ Resilience to Coercive Interrogation
- [2023/11] MM-SafetyBench: A Benchmark for Safety Evaluation of Multimodal Large Language Models
- [2023/11] A Wolf in Sheep's Clothing: Generalized Nested Jailbreak Prompts Can Fool Large Language Models Easily
- [2023/11] Exploiting Programmatic Behavior of LLMs: Dual-Use Through Standard Security Attacks
- [2023/11] MART: Improving LLM Safety with Multi-round Automatic Red-Teaming
- [2023/11] Scalable and Transferable Black-Box Jailbreaks for Language Models via Persona Modulation
- [2023/11] SneakyPrompt: Jailbreaking Text-to-image Generative Models
- [2023/11] DeepInception: Hypnotize Large Language Model to Be Jailbreaker
- [2023/11] Summon a Demon and Bind it: A Grounded Theory of LLM Red Teaming in the Wild
- [2023/11] Evil Geniuses: Delving into the Safety of LLM-based Agents
- [2023/11] FigStep: Jailbreaking Large Vision-language Models via Typographic Visual Prompts
- [2023/10] Attack Prompt Generation for Red Teaming and Defending Large Language Models
- [2023/10] Survey of Vulnerabilities in Large Language Models Revealed by Adversarial Attack
- [2023/10] Low-Resource Languages Jailbreak GPT-4
- [2023/10] SC-Safety: A Multi-round Open-ended Question Adversarial Safety Benchmark for Large Language Models in Chinese
- [2023/10] SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks
- [2023/10] Adversarial Attacks on LLMs
- [2023/10] AutoDAN: Interpretable Gradient-Based Adversarial Attacks on Large Language Models
- [2023/10] Jailbreak and Guard Aligned Language Models with Only Few In-Context Demonstrations
- [2023/10] Jailbreaking Black Box Large Language Models in Twenty Queries
- [2023/09] Baseline Defenses for Adversarial Attacks Against Aligned Language Models
- [2023/09] Certifying LLM Safety against Adversarial Prompting
- [2023/09] SurrogatePrompt: Bypassing the Safety Filter of Text-To-Image Models via Substitution
- [2023/09] Catastrophic Jailbreak of Open-source LLMs via Exploiting Generation
- [2023/09] AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models
- [2023/09] GPT-4 Is Too Smart To Be Safe: Stealthy Chat with LLMs via Cipher
- [2023/09] Jailbreak in pieces: Compositional Adversarial Attacks on Multi-Modal Language Models
- [2023/09] Multilingual Jailbreak Challenges in Large Language Models
- [2023/09] On the Humanity of Conversational AI: Evaluating the Psychological Portrayal of LLMs
- [2023/09] RAIN: Your Language Models Can Align Themselves without Finetuning
- [2023/09] Safety-Tuned LLaMAs: Lessons From Improving the Safety of Large Language Models that Follow Instructions
- [2023/09] Understanding Hidden Context in Preference Learning: Consequences for RLHF
- [2023/09] Defending Against Alignment-Breaking Attacks via Robustly Aligned LLM
- [2023/09] FuzzLLM: A Novel and Universal Fuzzing Framework for Proactively Discovering Jailbreak Vulnerabilities in Large Language Models
- [2023/09] GPTFUZZER: Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts
- [2023/09] Open Sesame! Universal Black Box Jailbreaking of Large Language Models
- [2023/08] Red-Teaming Large Language Models using Chain of Utterances for Safety-Alignment
- [2023/08] XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models
- [2023/08] “Do Anything Now”: Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models
- [2023/08] Detecting Language Model Attacks with Perplexity
- [2023/07] From ChatGPT to ThreatGPT: Impact of Generative AI in Cybersecurity and Privacy
- [2023/07] LLM Censorship: A Machine Learning Challenge Or A Computer Security Problem?
- [2023/07] Latent Jailbreak: A Benchmark for Evaluating Text Safety and Output Robustness of Large Language Models
- [2023/07] Jailbroken: How Does LLM Safety Training Fail?
- [2023/07] MasterKey: Automated Jailbreak Across Multiple Large Language Model Chatbots
- [2023/07] Universal and Transferable Adversarial Attacks on Aligned Language Models
- [2023/06] Visual Adversarial Examples Jailbreak Aligned Large Language Models
- [2023/05] Adversarial demonstration attacks on large language models.
- [2023/05] Jailbreaking ChatGPT via Prompt Engineering: An Empirical Study
- [2023/05] Tricking LLMs into Disobedience: Understanding, Analyzing, and Preventing Jailbreaks
- [2023/04] Multi-step Jailbreaking Privacy Attacks on ChatGPT
- [2023/03] Automatically Auditing Large Language Models via Discrete Optimization