This is the public repository of the paper Comprehensive Assessment of Jailbreak Attacks Against LLMs.
The following updates will be released first on the official repository in the future.
Be careful! This repository may contain harmful/offensive responses. Users need to use this repository responsibly.
- Clone this repository.
- Prepare the python ENV.
conda create -n CJA python=3.10
conda activate CJA
cd PATH_TO_THE_REPOSITORY
pip install -r requirements.txt
Option 1: label single file
- Switch directory:
cd ./scripts_label
- Command to label single file:
python label.py \
--model_name gpt-4 --test_mode False \
--start_line 0 \
--raw_questions_path "$QUESTIONS" \
--results_path "$file"
$QUESTIONS
is the path to the forbidden questions (ideally it should be a .csv
file, refer to ./forbidden_questions/forbidden_questions.csv for example).
$file
is the path to the LLM responses after jailbreak, it should be a .json
file. The .json
file could be generated by the following codes.
answers.append({'response': answer})
# Write into the output file
with open(output_file, 'w') as out_file:
json.dump(answers, out_file, indent=4)
Note that answer
is the response from the target LLM suffering jailbreak attacks.
Option 2: label files in a directory
You may also utilize label.sh to label files in a directory:
bash label.sh PATH_TO_RESPONSES_DIRECTORY
The files storing the labels will be saved to the same directory where you store the jailbreak responses.
NOTE: We have omitted the harmful responses related to the project. For example, the few-shot examples in scripts_label/label.py. Feel free to use your own examples.
- Switch directory:
cd ./scripts_defense
- Execute the defense:
bash ./defense_execute.sh DEFENSE_METHOD PATH_TO_YOUR_ADV_PROMPTS_FOLDER
Currently, seven defense methods are supported (refer to ./scripts_defense/defense_execute.sh for details).
The adv prompts folder should follow such a structure:
example_adv_prompts
└─ adv_basic.json
The .json
file could be obtained by the following codes:
adv_prompts = [prompt_1, prompt_2, ...] # a list of adv prompts
json_file = OUTPUT_PATH
with open(json_file, 'w') as outfile:
json.dump(adv_prompts, outfile, indent=4)
Refer to folder ./example_adv_prompts for an example.
Welcome to submit your own evaluation results (steps = 50) of jailbreak attacks to us. The leaderboard is available here.
Full codes will be released after the paper is accepted.
- Check the env file requirements.txt.
- Test the guide in the README.md.
- Clean the codes/comments.