Red Teamer Mistral Nemo

Warning

This research is conducted for educational and defensive purposes only, aiming to improve AI safety and security.

This project explores the potential use of red-teaming models to jailbreak LLMs. I fine-tuned Mistral Nemo on the WildJailbreak dataset.

Key features of this project include:

Finetuning

I used the code in the fine_tune folder to finetune the mistralai/Mistral-Nemo-Instruct-2407 model on the subset of the WildJailbreak dataset containing only the adversarial_harmful data type.

To run the whole pipeline in order to finetune the model, you can run the following command:

accelerate launch --config_file configs/accelerate/deepspeed_zero3.yaml src/fine_tune.py

→ The merged model can be found here and the adapter can be found here.

Harmbench evaluation

I forked the Harmbench repo and added the code to evaluate the finetuned model against a variety of models, all the instructions can be found directly in the Harmbench README to run the evaluation pipeline.

The name of the baseline is RedTeamerMistralNemo.

You can therefore run the full evaluation pipeline with the following command:

cd HarmBench
python scripts/run_pipeline.py --methods RedTeamerMistralNemo --models llama2_7b --step all --mode local

Generating examples

I used the code in the generate_examples folder to generate examples from the finetuned model and reported the results in wandb.

Serving

The code in the serve script is used to serve the model as an OpenAI-compatible API using VLLM and Ray.

You can run the server with:

cd src
serve run serve:build_app model="romaingrx/red-teamer-mistral-nemo" max-model-len=118000

Metrics

The VLLM server serves a /metrics endpoint that can be used to get the metrics using prometheus. You can therefore run the docker in the dockers/prometheus folder to get a dashboard with Grafana to see the metrics of the served models in real time.

Run the following command to run the docker compose:

cd dockers/prometheus
docker compose up -d

And you should have the vLLM dashboard by default in the Grafana interface.

You can then access the dashboard at http://localhost:3000 with user admin and password admin.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
Harmbench @ 3eba9bf		Harmbench @ 3eba9bf
configs		configs
dockers/prometheus		dockers/prometheus
src		src
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE		LICENSE
README.md		README.md
environment.yml		environment.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Red Teamer Mistral Nemo

Finetuning

Harmbench evaluation

Generating examples

Serving

Metrics

About

Releases

Packages

Languages

License

romaingrx/red-teamer-mistral-nemo

Folders and files

Latest commit

History

Repository files navigation

Red Teamer Mistral Nemo

Finetuning

Harmbench evaluation

Generating examples

Serving

Metrics

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages