This repo is the official implementation of PAT: Pruning-Aware Tuning for Large Language Models
. (arxiv)
- 2024.9 - We merged pruned PAT(25%)-Llama2 which can be loaded by
transformers[with-our-modification]
. (download) - 2024.8 - We release the paper and code for PAT. (arxiv)
Modified from FireFly
conda create -n pat python=3.10
pip install torch==2.3.0 torchvision==0.18.0 torchaudio==2.3.0 --index-url https://download.pytorch.org/whl/cu118
cd transformers-4.40.1
pip install -e .
cd ../peft
pip install -e .
cd ..
pip install -r requirements.txt
We employ Lamini-Instruction
for fine-tuning, which can be found here in HuggingFace. Additionally, we provide our 50% randomly sampled data in this link.
ADAPTER=<path-to-adaptor>
FT_MODE=dimdown
GPU=0
CUDA_VISIBLE_DEVICES=$GPU python chat.py \
--model_name_or_path meta-llama/Llama-2-7b-hf \
--adapter_name_or_path $ADAPTER \
--template_name llama2-base-alpaca \
--ft_mode $FT_MODE \
--trainable_mask \
--identity_loss \
--chat debug-all
We can merge the HSMs after PAT by using script/merge_dimdown.py
.
ADAPTER=<path-to-adaptor>
python script/merge_dimdown.py \
--model_dir meta-llama/Llama-2-7b-hf \
--adaptor_path $ADAPTER
Additionally, we provide some PAT results here.
- Llama 2 7B
- Llama 2 13B
- Gemma 2B
- Gemma 7B
- Yi-1.5 34B