Warning
This research is conducted for educational and defensive purposes only, aiming to improve AI safety and security.
This project explores the potential use of red-teaming models to jailbreak LLMs. I fine-tuned Mistral Nemo on the WildJailbreak dataset.
Key features of this project include:
- Fine-tuning Mistral Nemo on pairs of prompts and responses from the WildJailbreak dataset
- Evaluation against other models using HarmBench
- Generation of example outputs
- Deployment as an OpenAI-compatible API with VLLM and Ray
- Detailed metrics and analysis
I used the code in the fine_tune
folder to finetune the mistralai/Mistral-Nemo-Instruct-2407
model on the subset of the WildJailbreak dataset containing only the adversarial_harmful
data type.
To run the whole pipeline in order to finetune the model, you can run the following command:
accelerate launch --config_file configs/accelerate/deepspeed_zero3.yaml src/fine_tune.py
→ The merged model can be found here and the adapter can be found here.
I forked the Harmbench repo and added the code to evaluate the finetuned model against a variety of models, all the instructions can be found directly in the Harmbench README to run the evaluation pipeline.
The name of the baseline is RedTeamerMistralNemo
.
You can therefore run the full evaluation pipeline with the following command:
cd HarmBench
python scripts/run_pipeline.py --methods RedTeamerMistralNemo --models llama2_7b --step all --mode local
I used the code in the generate_examples
folder to generate examples from the finetuned model and reported the results in wandb.
The code in the serve
script is used to serve the model as an OpenAI-compatible API using VLLM and Ray.
You can run the server with:
cd src
serve run serve:build_app model="romaingrx/red-teamer-mistral-nemo" max-model-len=118000
The VLLM server serves a /metrics
endpoint that can be used to get the metrics using prometheus. You can therefore run the docker in the dockers/prometheus folder to get a dashboard with Grafana to see the metrics of the served models in real time.
Run the following command to run the docker compose:
cd dockers/prometheus
docker compose up -d
And you should have the vLLM
dashboard by default in the Grafana interface.
You can then access the dashboard at http://localhost:3000 with user admin
and password admin
.