Unlearning Backdoor Attacks for LLMs with Weak-to-Strong Knowledge Distillation
- Python == 3.8.19
- torch == 2.2.2+cu118
- transformers == 4.40.2
- accelerate == 0.30.0
- deepspeed == 0.15.1
- peft == 0.12.0
Please download the poisoned model weight, and then modify the directory of the bin file: BadNet Attack for LLaMA; IntSent Attack for LLaMA; SynAttack Attack for LLaMA.
Please download the clean teacher model
cd word # download poisoned model weight.
DS_SKIP_CUDA_CHECK=1 python lora.py --model_name_or_path meta-llama/Meta-Llama-3-8B --poison word
DS_SKIP_CUDA_CHECK=1 python unlearning.py --model_name_or_path meta-llama/Meta-Llama-3-8B --poison word
DS_SKIP_CUDA_CHECK=1 python test.py --model_name_or_path meta-llama/Meta-Llama-3-8B --poison word
If you have any issues or questions about this repo, feel free to contact shuai.zhao@ntu.edu.sg.