Skip to content

Finetuning of Mistral Nemo 13B on the WildJailbreak dataset to produce a red-teaming model

License

Notifications You must be signed in to change notification settings

romaingrx/red-teamer-mistral-nemo

Repository files navigation

Red Teamer Mistral Nemo

Warning

This research is conducted for educational and defensive purposes only, aiming to improve AI safety and security.

This project explores the potential use of red-teaming models to jailbreak LLMs. I fine-tuned Mistral Nemo on the WildJailbreak dataset.

Key features of this project include:

Finetuning

I used the code in the fine_tune folder to finetune the mistralai/Mistral-Nemo-Instruct-2407 model on the subset of the WildJailbreak dataset containing only the adversarial_harmful data type.

To run the whole pipeline in order to finetune the model, you can run the following command:

accelerate launch --config_file configs/accelerate/deepspeed_zero3.yaml src/fine_tune.py

→ The merged model can be found here and the adapter can be found here.

Harmbench evaluation

I forked the Harmbench repo and added the code to evaluate the finetuned model against a variety of models, all the instructions can be found directly in the Harmbench README to run the evaluation pipeline.

The name of the baseline is RedTeamerMistralNemo.

You can therefore run the full evaluation pipeline with the following command:

cd HarmBench
python scripts/run_pipeline.py --methods RedTeamerMistralNemo --models llama2_7b --step all --mode local

Generating examples

I used the code in the generate_examples folder to generate examples from the finetuned model and reported the results in wandb.

Serving

The code in the serve script is used to serve the model as an OpenAI-compatible API using VLLM and Ray.

You can run the server with:

cd src
serve run serve:build_app model="romaingrx/red-teamer-mistral-nemo" max-model-len=118000

Metrics

The VLLM server serves a /metrics endpoint that can be used to get the metrics using prometheus. You can therefore run the docker in the dockers/prometheus folder to get a dashboard with Grafana to see the metrics of the served models in real time.

Run the following command to run the docker compose:

cd dockers/prometheus
docker compose up -d

And you should have the vLLM dashboard by default in the Grafana interface.

You can then access the dashboard at http://localhost:3000 with user admin and password admin.

About

Finetuning of Mistral Nemo 13B on the WildJailbreak dataset to produce a red-teaming model

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages