Adversarial Attack on Open Source Language Models

Overview

This repository demonstrates an embedding space attack on a language model, specifically GPT-Neo 1.3B. The attack manipulates the model's response by subtly altering the input embeddings. The goal is to understand how the responses of a language model can be manipulated, providing insights into potential vulnerabilities and defense mechanisms in AI systems.

Key Resources

Medium Article - Cracking the Code: How Adversarial Attacks Manipulate AI Language Models
Original Paper - Adversarial Attacks and Defenses in Large Language Models: Old and New Threats by Leo Schwinn, David Dobre, Stephan Günnemann, and Gauthier Gidel
Original Paper Repository

Files

check_torch.py: This script checks the hardware settings and configurations of your PyTorch installation. It helps ensure that your system is properly set up to run the model and the attack scripts.

embedding_attack.py: The core script that runs the embedding space attack on the GPT-Neo 1.3B model. This file contains the implementation of the attack, including generating adversarial examples by manipulating embeddings to achieve the target output.

establish_baseline.py: This script establishes a baseline for the model's responses. It runs the model on a predefined input prompt and prints the output. This helps in comparing the baseline output with the manipulated outputs during the attack.

Getting Started

Prerequisites

Python 3.7 or later
PyTorch 1.9 or later
Transformers library from Hugging Face

Install the required libraries using:

pip install -r requirements.txt

Running the Scripts

Check Hardware Settings:

python check_torch.py

Establish Baseline Response:

python establish_baseline.py

Run the Embedding Space Attack:

python embedding_attack.py

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.gitignore		.gitignore
README.md		README.md
attack_results.xlsx		attack_results.xlsx
check_torch.py		check_torch.py
embedding_attack.py		embedding_attack.py
establish_baseline.py		establish_baseline.py
matrix_basics.py		matrix_basics.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Adversarial Attack on Open Source Language Models

Overview

Key Resources

Files

Getting Started

Prerequisites

Running the Scripts

About

Releases

Packages

Languages

Sakshee5/Adversarial-Attacks-on-LLMs

Folders and files

Latest commit

History

Repository files navigation

Adversarial Attack on Open Source Language Models

Overview

Key Resources

Files

Getting Started

Prerequisites

Running the Scripts

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages