A1. Jailbreak

[2024/07] Exploring Scaling Trends in LLM Robustness
[2024/07] The Dark Side of Function Calling: Pathways to Jailbreaking Large Language Models
[2024/07] Can Large Language Models Automatically Jailbreak GPT-4V?
[2024/07] PrimeGuard: Safe and Helpful LLMs through Tuning-Free Routing
[2024/07] Figure it Out: Analyzing-based Jailbreak Attack on Large Language Models
[2024/07] RedAgent: Red Teaming Large Language Models with Context-aware Autonomous Language Agent
[2024/07] Arondight: Red Teaming Large Vision Language Models with Auto-generated Multi-modal Jailbreak Prompts
[2024/07] When Do Universal Image Jailbreaks Transfer Between Vision-Language Models?
[2024/07] Imposter.AI: Adversarial Attacks with Hidden Intentions towards Aligned Large Language Models
[2024/07] Does Refusal Training in LLMs Generalize to the Past Tense?
[2024/07] Continuous Embedding Attacks via Clipped Inputs in Jailbreaking Large Language Models
[2024/07] Jailbreak Attacks and Defenses Against Large Language Models: A Survey
[2024/07] DART: Deep Adversarial Automated Red Teaming for LLM Safety
[2024/07] JailbreakZoo: Survey, Landscapes, and Horizons in Jailbreaking Large Language and Vision-Language Models
[2024/07] SoP: Unlock the Power of Social Facilitation for Automatic Jailbreak Attack
[2024/07] Image-to-Text Logic Jailbreak: Your Imagination can Help You Do Anything
[2024/07] A False Sense of Safety: Unsafe Information Leakage in 'Safe' AI Responses
[2024/07] Safe Unlearning: A Surprisingly Effective and Generalizable Solution to Defend Against Jailbreak Attacks
[2024/07] Badllama 3: removing safety finetuning from Llama 3 in minutes
[2024/06] Virtual Context: Enhancing Jailbreak Attacks with Special Token Injection
[2024/06] Covert Malicious Finetuning: Challenges in Safeguarding LLM Adaptation
[2024/06] Poisoned LangChain: Jailbreak LLMs by LangChain
[2024/06] WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs
[2024/06] WildTeaming at Scale: From In-the-Wild Jailbreaks to (Adversarially) Safer Language Models
[2024/06] SafeAligner: Safety Alignment against Jailbreak Attacks via Response Disparity Guidance
[2024/06] Adversaries Can Misuse Combinations of Safe Models
[2024/06] Jailbreak Paradox: The Achilles' Heel of LLMs
[2024/06] "Not Aligned" is Not "Malicious": Being Careful about Hallucinations of Large Language Models' Jailbreak
[2024/06] Knowledge-to-Jailbreak: One Knowledge Point Worth One Attack
[2024/06] Understanding Jailbreak Success: A Study of Latent Space Dynamics in Large Language Models
[2024/06] StructuralSleight: Automated Jailbreak Attacks on Large Language Models Utilizing Uncommon Text-Encoded Structure
[2024/06] When LLM Meets DRL: Advancing Jailbreaking Efficiency via DRL-guided Search
[2024/06] RL-JACK: Reinforcement Learning-powered Black-box Jailbreaking Attack against LLMs
[2024/06] Bag of Tricks: Benchmarking of Jailbreak Attacks on LLMs
[2024/06] MLLMGuard: A Multi-dimensional Safety Evaluation Suite for Multimodal Large Language Models
[2024/06] Merging Improves Self-Critique Against Jailbreak Attacks
[2024/06] How Alignment and Jailbreak Work: Explain LLM Safety through Intermediate Hidden States
[2024/06] SelfDefend: LLMs Can Defend Themselves against Jailbreaking in a Practical Manner
[2024/06] Unveiling the Safety of GPT-4o: An Empirical Study using Jailbreak Attacks
[2024/06] Adversarial Tuning: Defending Against Jailbreak Attacks for LLMs
[2024/06] Improving Alignment and Robustness with Short Circuiting
[2024/06] Are PPO-ed Language Models Hackable?
[2024/06] Cross-Modal Safety Alignment: Is textual unlearning all you need?
[2024/06] Defending Large Language Models Against Attacks With Residual Stream Activation Analysis
[2024/06] Jailbreak Vision Language Models via Bi-Modal Adversarial Prompt
[2024/06] AutoJailbreak: Exploring Jailbreak Attacks and Defenses through a Dependency Lens
[2024/06] Improved Few-Shot Jailbreaking Can Circumvent Aligned Language Models and Their Defenses
[2024/06] BELLS: A Framework Towards Future Proof Benchmarks for the Evaluation of LLM Safeguards
[2024/05] Improved Techniques for Optimization-Based Jailbreaking on Large Language Models
[2024/05] Jailbreaking Large Language Models Against Moderation Guardrails via Cipher Characters
[2024/05] Visual-RolePlay: Universal Jailbreak Attack on MultiModal Large Language Models via Role-playing Image Characte
[2024/05] Cross-Modality Jailbreak and Mismatched Attacks on Medical Multimodal Large Language Models
[2024/05] Improved Generation of Adversarial Examples Against Safety-aligned LLMs
[2024/05] Robustifying Safety-Aligned Large Language Models through Clean Data Curation
[2024/05] Defensive Prompt Patch: A Robust and Interpretable Defense of LLMs against Jailbreak Attacks
[2024/05] Voice Jailbreak Attacks Against GPT-4o
[2024/05] Defending Large Language Models Against Jailbreak Attacks via Layer-specific Editing
[2024/05] Automatic Jailbreaking of the Text-to-Image Generative AI Systems
[2024/05] Hacc-Man: An Arcade Game for Jailbreaking LLMs
[2024/05] Efficient Adversarial Training in LLMs with Continuous Attacks
[2024/05] JailbreakEval: An Integrated Safety Evaluator Toolkit for Assessing Jailbreaks Against Large Language Models
[2024/05] Cross-Task Defense: Instruction-Tuning LLMs for Content Safety
[2024/05] Lockpicking LLMs: A Logit-Based Jailbreak Using Token-level Manipulation
[2024/05] GPT-4 Jailbreaks Itself with Near-Perfect Success Using Self-Explanation
[2024/05] Chain of Attack: a Semantic-Driven Contextual Multi-Turn attacker for LLM
[2024/05] Can LLMs Deeply Detect Complex Malicious Queries? A Framework for Jailbreaking via Obfuscating Intent
[2024/04] Don't Say No: Jailbreaking LLM by Suppressing Refusal
[2024/04] Universal Adversarial Triggers Are Not Universal
[2024/04] AdvPrompter: Fast Adaptive Adversarial Prompting for LLMs
[2024/04] The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions
[2024/04] Competition Report: Finding Universal Jailbreak Backdoors in Aligned LLMs
[2024/04] Protecting Your LLMs with Information Bottleneck
[2024/04] JailbreakLens: Visual Analysis of Jailbreak Attacks Against Large Language Models
[2024/04] AmpleGCG: Learning a Universal and Transferable Generative Model of Adversarial Suffixes for Jailbreaking Both Open and Closed LLMs
[2024/04] Sandwich attack: Multi-language Mixture Adaptive Attack on LLMs
[2024/04] AEGIS: Online Adaptive AI Content Safety Moderation with Ensemble of LLM Experts
[2024/04] Eraser: Jailbreaking Defense in Large Language Models via Unlearning Harmful Knowledge
[2024/04] Take a Look at it! Rethinking How to Evaluate Language Model Jailbreak
[2024/04] Unbridled Icarus: A Survey of the Potential Perils of Image Inputs in Multimodal Large Language Model Security
[2024/04] Increased LLM Vulnerabilities from Fine-tuning and Quantization
[2024/04] Red Teaming GPT-4V: Are GPT-4V Safe Against Uni/Multi-Modal Jailbreak Attacks?
[2024/04] Jailbreaking Prompt Attack: A Controllable Adversarial Attack against Diffusion Models
[2024/04] JailBreakV-28K: A Benchmark for Assessing the Robustness of MultiModal Large Language Models against Jailbreak Attacks
[2024/04] JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models
[2024/04] Great, Now Write an Article About That: The Crescendo Multi-Turn LLM Jailbreak Attack
[2024/04] Jailbreaking Leading Safety-Aligned LLMs with Simple Adaptive Attacks
[2024/04] Many-shot Jailbreaking
[2024/03] Against The Achilles' Heel: A Survey on Red Teaming for Generative Models
[2024/03] Don't Listen To Me: Understanding and Exploring Jailbreak Prompts of Large Language Models
[2024/03] Jailbreaking is Best Solved by Definition
[2024/03] Detoxifying Large Language Models via Knowledge Editing
[2024/03] RigorLLM: Resilient Guardrails for Large Language Models against Undesired Content
[2024/03] Images are Achilles’ Heel of Alignment: Exploiting Visual Vulnerabilities for Jailbreaking Multimodal Large Language Models
[2024/03] AdaShield: Safeguarding Multimodal Large Language Models from Structure-based Attack via Adaptive Shield Prompting
[2024/03] Tastle: Distract Large Language Models for Automatic Jailbreak Attack
[2024/03] Exploring Safety Generalization Challenges of Large Language Models via Code
[2024/03] AutoDefense: Multi-Agent LLM Defense against Jailbreak Attacks
[2024/03] Gradient Cuff: Detecting Jailbreak Attacks on Large Language Models by Exploring Refusal Loss Landscapes
[2024/02] Leveraging the Context through Multi-Round Interactions for Jailbreaking Attacks
[2024/02] Making Them Ask and Answer: Jailbreaking Large Language Models in Few Queries via Disguise and Reconstruction
[2024/02] DrAttack: Prompt Decomposition and Reconstruction Makes Powerful LLM Jailbreakers
[2024/02] GUARD: Role-playing to Generate Natural-language Jailbreakings to Test Guideline Adherence of Large Language Models
[2024/02] CodeChameleon: Personalized Encryption Framework for Jailbreaking Large Language Models
[2024/02] PRP: Propagating Universal Perturbations to Attack Large Language Model Guard-Rails
[2024/02] Defending Large Language Models against Jailbreak Attacks via Semantic Smoothing
[2024/02] LLMs Can Defend Themselves Against Jailbreaking in a Practical Manner: A Vision Paper
[2024/02] From Noise to Clarity: Unraveling the Adversarial Suffix of Large Language Model Attacks via Translation of Text Embeddings
[2024/02] Semantic Mirror Jailbreak: Genetic Algorithm Based Jailbreak Prompts Against Open-source LLMs
[2024/02] Is the System Message Really Important to Jailbreaks in Large Language Models?
[2024/02] Break the Breakout: Reinventing LM Defense Against Jailbreak Attacks with Self-Refinement
[2024/02] How (un)ethical are instruction-centric responses of LLMs? Unveiling the vulnerabilities of safety guardrails to harmful queries
[2024/02] Mitigating Fine-tuning Jailbreak Attack with Backdoor Enhanced Alignment
[2024/02] LLM Jailbreak Attack versus Defense Techniques -- A Comprehensive Study
[2024/02] Coercing LLMs to do and reveal (almost) anything
[2024/02] GradSafe: Detecting Unsafe Prompts for LLMs via Safety-Critical Gradient Analysis
[2024/02] Query-Based Adversarial Prompt Generation
[2024/02] ArtPrompt: ASCII Art-based Jailbreak Attacks against Aligned LLMs
[2024/02] SPML: A DSL for Defending Language Models Against Prompt Attacks
[2024/02] A StrongREJECT for Empty Jailbreaks
[2024/02] Jailbreaking Proprietary Large Language Models using Word Substitution Cipher
[2024/02] ToolSword: Unveiling Safety Issues of Large Language Models in Tool Learning Across Three Stages
[2024/02] PAL: Proxy-Guided Black-Box Attack on Large Language Models
[2024/02] Attacking Large Language Models with Projected Gradient Descent
[2024/02] SafeDecoding: Defending against Jailbreak Attacks via Safety-Aware Decoding
[2024/02] Play Guessing Game with LLM: Indirect Jailbreak Attack with Implicit Clues
[2024/02] COLD-Attack: Jailbreaking LLMs with Stealthiness and Controllability
[2024/02] Agent Smith: A Single Image Can Jailbreak One Million Multimodal LLM Agents Exponentially Fast
[2024/02] Pandora: Jailbreak GPTs by Retrieval Augmented Generation Poisoning
[2024/02] Comprehensive Assessment of Jailbreak Attacks Against LLMs
[2024/02] Safety Fine-Tuning at (Almost) No Cost: A Baseline for Vision Large Language Models
[2024/02] HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal
[2024/02] Jailbreaking Attack against Multimodal Large Language Model
[2024/02] Prompt-Driven LLM Safeguarding via Directed Representation Optimization
[2024/01] On Prompt-Driven Safeguarding for Large Language Models
[2024/01] A Cross-Language Investigation into Jailbreak Attacks in Large Language Models
[2024/01] Weak-to-Strong Jailbreaking on Large Language Models
[2024/01] Robust Prompt Optimization for Defending Language Models Against Jailbreaking Attacks
[2024/01] Jailbreaking GPT-4V via Self-Adversarial Attacks with System Prompts
[2024/01] PsySafe: A Comprehensive Framework for Psychological-based Attack, Defense, and Evaluation of Multi-agent System Safety
[2024/01] Sowing the Wind, Reaping the Whirlwind: The Impact of Editing Language Models
[2024/01] Pruning for Protection: Increasing Jailbreak Resistance in Aligned LLMs Without Fine-Tuning
[2024/01] All in How You Ask for It: Simple Black-Box Method for Jailbreak Attacks
[2024/01] AttackEval: How to Evaluate the Effectiveness of Jailbreak Attacking on Large Language Models
[2024/01] Intention Analysis Prompting Makes Large Language Models A Good Jailbreak Defender
[2024/01] How Johnny Can Persuade LLMs to Jailbreak Them: Rethinking Persuasion to Challenge AI Safety by Humanizing LLMs
[2023/12] A Comprehensive Survey of Attack Techniques, Implementation, and Mitigation Strategies in Large Language Models
[2023/12] Analyzing the Inherent Response Tendency of LLMs: Real-World Instructions-Driven Jailbreak
[2023/12] Goal-Oriented Prompt Attack and Safety Evaluation for LLMs
[2023/12] Tree of Attacks: Jailbreaking Black-Box LLMs Automatically
[2023/12] Safety Alignment in NLP Tasks: Weakly Aligned Summarization as an In-Context Attack
[2023/12] A Mutation-Based Method for Multi-Modal Jailbreaking Attack Detection
[2023/12] Adversarial Attacks on GPT-4 via Simple Random Search
[2023/12] On Large Language Models’ Resilience to Coercive Interrogation
[2023/11] MM-SafetyBench: A Benchmark for Safety Evaluation of Multimodal Large Language Models
[2023/11] A Wolf in Sheep's Clothing: Generalized Nested Jailbreak Prompts Can Fool Large Language Models Easily
[2023/11] Exploiting Programmatic Behavior of LLMs: Dual-Use Through Standard Security Attacks
[2023/11] MART: Improving LLM Safety with Multi-round Automatic Red-Teaming
[2023/11] Scalable and Transferable Black-Box Jailbreaks for Language Models via Persona Modulation
[2023/11] SneakyPrompt: Jailbreaking Text-to-image Generative Models
[2023/11] DeepInception: Hypnotize Large Language Model to Be Jailbreaker
[2023/11] Summon a Demon and Bind it: A Grounded Theory of LLM Red Teaming in the Wild
[2023/11] Evil Geniuses: Delving into the Safety of LLM-based Agents
[2023/11] FigStep: Jailbreaking Large Vision-language Models via Typographic Visual Prompts
[2023/10] Attack Prompt Generation for Red Teaming and Defending Large Language Models
[2023/10] Survey of Vulnerabilities in Large Language Models Revealed by Adversarial Attack
[2023/10] Low-Resource Languages Jailbreak GPT-4
[2023/10] SC-Safety: A Multi-round Open-ended Question Adversarial Safety Benchmark for Large Language Models in Chinese
[2023/10] SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks
[2023/10] Adversarial Attacks on LLMs
[2023/10] AutoDAN: Interpretable Gradient-Based Adversarial Attacks on Large Language Models
[2023/10] Jailbreak and Guard Aligned Language Models with Only Few In-Context Demonstrations
[2023/10] Jailbreaking Black Box Large Language Models in Twenty Queries
[2023/09] Baseline Defenses for Adversarial Attacks Against Aligned Language Models
[2023/09] Certifying LLM Safety against Adversarial Prompting
[2023/09] SurrogatePrompt: Bypassing the Safety Filter of Text-To-Image Models via Substitution
[2023/09] Catastrophic Jailbreak of Open-source LLMs via Exploiting Generation
[2023/09] AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models
[2023/09] GPT-4 Is Too Smart To Be Safe: Stealthy Chat with LLMs via Cipher
[2023/09] Jailbreak in pieces: Compositional Adversarial Attacks on Multi-Modal Language Models
[2023/09] Multilingual Jailbreak Challenges in Large Language Models
[2023/09] On the Humanity of Conversational AI: Evaluating the Psychological Portrayal of LLMs
[2023/09] RAIN: Your Language Models Can Align Themselves without Finetuning
[2023/09] Safety-Tuned LLaMAs: Lessons From Improving the Safety of Large Language Models that Follow Instructions
[2023/09] Understanding Hidden Context in Preference Learning: Consequences for RLHF
[2023/09] Defending Against Alignment-Breaking Attacks via Robustly Aligned LLM
[2023/09] FuzzLLM: A Novel and Universal Fuzzing Framework for Proactively Discovering Jailbreak Vulnerabilities in Large Language Models
[2023/09] GPTFUZZER: Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts
[2023/09] Open Sesame! Universal Black Box Jailbreaking of Large Language Models
[2023/08] Red-Teaming Large Language Models using Chain of Utterances for Safety-Alignment
[2023/08] XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models
[2023/08] “Do Anything Now”: Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models
[2023/08] Detecting Language Model Attacks with Perplexity
[2023/07] From ChatGPT to ThreatGPT: Impact of Generative AI in Cybersecurity and Privacy
[2023/07] LLM Censorship: A Machine Learning Challenge Or A Computer Security Problem?
[2023/07] Latent Jailbreak: A Benchmark for Evaluating Text Safety and Output Robustness of Large Language Models
[2023/07] Jailbroken: How Does LLM Safety Training Fail?
[2023/07] MasterKey: Automated Jailbreak Across Multiple Large Language Model Chatbots
[2023/07] Universal and Transferable Adversarial Attacks on Aligned Language Models
[2023/06] Visual Adversarial Examples Jailbreak Aligned Large Language Models
[2023/05] Adversarial demonstration attacks on large language models.
[2023/05] Jailbreaking ChatGPT via Prompt Engineering: An Empirical Study
[2023/05] Tricking LLMs into Disobedience: Understanding, Analyzing, and Preventing Jailbreaks
[2023/04] Multi-step Jailbreaking Privacy Attacks on ChatGPT
[2023/03] Automatically Auditing Large Language Models via Discrete Optimization

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

jailbreak.md

jailbreak.md

A1. Jailbreak

Files

jailbreak.md

Latest commit

History

jailbreak.md

File metadata and controls

A1. Jailbreak