This repository demonstrates an embedding space attack on a language model, specifically GPT-Neo 1.3B. The attack manipulates the model's response by subtly altering the input embeddings. The goal is to understand how the responses of a language model can be manipulated, providing insights into potential vulnerabilities and defense mechanisms in AI systems.
-
Medium Article - Cracking the Code: How Adversarial Attacks Manipulate AI Language Models
-
Original Paper - Adversarial Attacks and Defenses in Large Language Models: Old and New Threats by Leo Schwinn, David Dobre, Stephan Günnemann, and Gauthier Gidel
check_torch.py
: This script checks the hardware settings and configurations of your PyTorch installation. It helps ensure that your system is properly set up to run the model and the attack scripts.
embedding_attack.py
: The core script that runs the embedding space attack on the GPT-Neo 1.3B model. This file contains the implementation of the attack, including generating adversarial examples by manipulating embeddings to achieve the target output.
establish_baseline.py
:
This script establishes a baseline for the model's responses. It runs the model on a predefined input prompt and prints the output. This helps in comparing the baseline output with the manipulated outputs during the attack.
- Python 3.7 or later
- PyTorch 1.9 or later
- Transformers library from Hugging Face
Install the required libraries using:
pip install -r requirements.txt
Check Hardware Settings:
python check_torch.py
Establish Baseline Response:
python establish_baseline.py
Run the Embedding Space Attack:
python embedding_attack.py