Skip to content

Commit c565e96

Browse files
authored
Change the calib dataset to pile-10k (#1518)
Signed-off-by: Zhang, Weiwei1 <weiwei1.zhang@intel.com>
1 parent c58aeaa commit c565e96

File tree

3 files changed

+67
-32
lines changed

3 files changed

+67
-32
lines changed

examples/pytorch/nlp/huggingface_models/language-modeling/pruning/eager/README.md

Lines changed: 50 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -44,9 +44,18 @@ Intel® Neural Compressor provides support for pruning and model slimming operat
4444

4545
Through experimental verification, it has been observed that pruning the Multi-Layer Perceptron (MLP) layers using a channel-wise pattern can achieve a sparsity level of 10%-20%. This pruning technique speeds up inference while maintaining an accuracy drop of less than 1%. [Retrain-free Example](https://github.com/intel/neural-compressor/tree/master/examples/pytorch/nlp/huggingface_models/language-modeling/pruning/eager/run_clm_no_trainer.py).
4646

47-
The pruning patterns of 1x1 and N:M are supported through the use of the [SparseGPT Example](https://github.com/intel/neural-compressor/tree/master/examples/pytorch/nlp/huggingface_models/language-modeling/pruning/eager/run_clm_sparsegpt.py), It is possible to prune models up to 70B in size within two hours, achieving a sparsity of 40%-50% in both the Multi-Head Attention (MHA) and MLP layers. For models of 7B and above, the drop in accuracy is less than 1%.
47+
The pruning patterns of 1x1 and N:M are supported through the use of the [SparseGPT Example](https://github.com/intel/neural-compressor/tree/master/examples/pytorch/nlp/huggingface_models/language-modeling/pruning/eager/run_clm_sparsegpt.py), It is possible to prune models up to 70B in size within two hours, achieving a sparsity of 40%-60% in both the Multi-Head Attention (MHA) and MLP layers. For models of 7B and above, the drop in accuracy is less than 1%.
48+
Note that Pruning for models at 30 billion parameters and above can be done on a single GPU card (such as the A100), while evaluation is recommended to be performed using multiple cards:
49+
```shell
50+
CUDA_VISIBLE_DEVICES=0,1 \
51+
python examples/pytorch/nlp/huggingface_models/language-modeling/pruning/eager/run_clm_sparsegpt.py \
52+
--model_name_or_path /PATH/TO/SPARSE/LLM/ \
53+
--device=0 \
54+
--eval_dtype 'bf16' \
55+
--per_device_eval_batch_size 2
56+
```
4857

49-
Pruning scripts are available for LLM sparse models such as GPT-j, BLOOM, OPT, LLaMA, and the sparse model can be obtained by modifying the pruning parameters. [Pruning Scripts](https://github.com/intel/neural-compressor/tree/master/examples/pytorch/nlp/huggingface_models/language-modeling/pruning/eager/scripts/).
58+
Pruning scripts are available for LLM sparse models such as GPT-j, BLOOM, OPT, LLaMA, Qwen, Chatglm, Mpt, Falcon, and the sparse model can be obtained by modifying the pruning parameters. [Pruning Scripts](https://github.com/intel/neural-compressor/tree/master/examples/pytorch/nlp/huggingface_models/language-modeling/pruning/eager/scripts/).
5059

5160
<br />
5261

@@ -71,27 +80,33 @@ The last word acc of the channel-wise sparse model is shown in the following tab
7180
| bigscience/bloom-7b1 | CLM | pile_10k | lambada_openai | BF16 | 0.5723 | 0.5756 | 0.58% |
7281

7382

74-
7583
## SparseGPT Results
7684

7785
The last word acc of the 1x1 pattern sparse model using [the sparseGPT script](https://github.com/intel/neural-compressor/tree/master/examples/pytorch/nlp/huggingface_models/language-modeling/pruning/eager/scripts/run_llm_sparsegpt.sh) is shown in the following table.
7886

87+
7988
| Model | Task | Calibration dataset | Evaluation dataset | Sparsity | Precision | Dense last word accuracy | Sparse last word accuracy | Relative drop |
8089
| :----: | :----: | :----: | :----: | :----: | :----: | :----: |:----: |:----:|
81-
| meta-llama/Llama-2-7b-hf | CLM | wikitext-2-raw-v1 | lambada_openai | 30% | FP32 | 0.7392 | 0.7320 | -0.97% |
82-
| meta-llama/Llama-2-7b-hf | CLM | wikitext-2-raw-v1 | lambada_openai | 30% | BF16 | 0.7365 | 0.7304 | -1.19% |
83-
| EleutherAI/gpt-j-6b | CLM | wikitext-2-raw-v1 | lambada_openai | 40% | FP32 | 0.6831 | 0.6922 | +1.33% |
84-
| EleutherAI/gpt-j-6b | CLM | wikitext-2-raw-v1 | lambada_openai | 40% | BF16 | 0.6771 | 0.6874 | +0.63% |
85-
| decapoda-research/llama-7b-hf | CLM | wikitext-2-raw-v1 | lambada_openai | 40% | FP32 | 0.7361 | 0.7332 | -0.39% |
86-
| decapoda-research/llama-7b-hf | CLM | wikitext-2-raw-v1 | lambada_openai | 40% | BF16 | 0.7326 | 0.7297 | -0.87% |
87-
| facebook/opt-6.7b | CLM | wikitext-2-raw-v1 | lambada_openai | 40% | FP32 | 0.6769 | 0.6616 | -2.26% |
88-
| facebook/opt-6.7b | CLM | wikitext-2-raw-v1 | lambada_openai | 40% | BF16 | 0.6730 | 0.6577 | -2.84% |
89-
| tiiuae/falcon-7b | CLM | wikitext-2-raw-v1 | lambada_openai | 40% | FP32 | 0.7467 | 0.7528 | +0.82% |
90-
| tiiuae/falcon-7b | CLM | wikitext-2-raw-v1 | lambada_openai | 40% | BF16 | 0.7464 | 0.7502 | +0.47% |
91-
| bigscience/bloom-7b1 | CLM | wikitext-2-raw-v1 | lambada_openai | 40% | FP32 | 0.5764 | 0.5606 | -2.74% |
92-
| bigscience/bloom-7b1 | CLM | wikitext-2-raw-v1 | lambada_openai | 40% | BF16 | 0.5725 | 0.5587 | -3.07% |
93-
| mosaicml/mpt-7b | CLM | wikitext-2-raw-v1 | lambada_openai | 40% | FP32 | 0.7056 | 0.7035 | -0.30% |
94-
| mosaicml/mpt-7b | CLM | wikitext-2-raw-v1 | lambada_openai | 40% | BF16 | 0.6831 | 0.6856 | -2.83% |
90+
| EleutherAI/gpt-j-6b | CLM | NeelNanda/pile-10k | lambada_openai | 40% | FP32 | 0.6831 | 0.6922 | +2.30% |
91+
| EleutherAI/gpt-j-6b | CLM | NeelNanda/pile-10k | lambada_openai | 40% | BF16 | 0.6781 | 0.6874 | +1.48% |
92+
| meta-llama/Llama-2-7b-hf | CLM | NeelNanda/pile-10k | lambada_openai | 40% | FP32 | 0.7392 | 0.7411 | +0.26% |
93+
| meta-llama/Llama-2-7b-hf | CLM | NeelNanda/pile-10k | lambada_openai | 40% | BF16 | 0.7361 | 0.7376 | -0.22% |
94+
| huggyllama/llama-7b | CLM | NeelNanda/pile-10k | lambada_openai | 40% | FP32 | 0.7361 | 0.7450 | +1.21% |
95+
| huggyllama/llama-7b | CLM | NeelNanda/pile-10k | lambada_openai | 40% | BF16 | 0.7308 | 0.7427 | +0.90% |
96+
| facebook/opt-6.7b | CLM | NeelNanda/pile-10k | lambada_openai | 40% | FP32 | 0.6769 | 0.6897 | +1.89% |
97+
| facebook/opt-6.7b | CLM | NeelNanda/pile-10k | lambada_openai | 40% | BF16 | 0.6765 | 0.6856 | +1.29% |
98+
| tiiuae/falcon-7b | CLM | NeelNanda/pile-10k | lambada_openai | 40% | FP32 | 0.7467 | 0.7555 | +1.18% |
99+
| tiiuae/falcon-7b | CLM | NeelNanda/pile-10k | lambada_openai | 40% | BF16 | 0.7467 | 0.7561 | +1.26% |
100+
| bigscience/bloom-7b1 | CLM | NeelNanda/pile-10k | lambada_openai | 40% | FP32 | 0.5764 | 0.5768 | +0.07% |
101+
| bigscience/bloom-7b1 | CLM | NeelNanda/pile-10k | lambada_openai | 40% | BF16 | 0.5731 | 0.5738 | -0.45% |
102+
| mosaicml/mpt-7b | CLM | NeelNanda/pile-10k | lambada_openai | 40% | FP32 | 0.7056 | 0.7114 | +0.82% |
103+
| mosaicml/mpt-7b | CLM | NeelNanda/pile-10k | lambada_openai | 40% | BF16 | 0.6831 | 0.6920 | -1.93% |
104+
| THUDM/chatglm3-6b | CLM | NeelNanda/pile-10k | lambada_openai | 40% | FP32 | 0.5888 | 0.5822 | -1.12% |
105+
| THUDM/chatglm3-6b | CLM | NeelNanda/pile-10k | lambada_openai | 40% | BF16 | 0.5878 | 0.5812 | -1.29% |
106+
| mistralai/Mistral-7B-v0.1 | CLM | NeelNanda/pile-10k | lambada_openai | 40% | FP32 | 0.7590 | 0.7803 | +2.81% |
107+
| mistralai/Mistral-7B-v0.1 | CLM | NeelNanda/pile-10k | lambada_openai | 40% | BF16 | 0.7561 | 0.7770 | +2.37% |
108+
| Qwen/Qwen-7B | CLM | NeelNanda/pile-10k | lambada_openai | 40% | FP32 | 0.6996 | 0.7085 | +1.27% |
109+
| Qwen/Qwen-7B | CLM | NeelNanda/pile-10k | lambada_openai | 40% | BF16 | 0.6959 | 0.7077 | +1.16% |
95110
| mosaicml/mpt-7b-chat | CLM | wikitext-2-raw-v1 | lambada_openai | 40% | FP32 | 0.6550 | 0.6561 | +0.17% |
96111
| mosaicml/mpt-7b-chat | CLM | wikitext-2-raw-v1 | lambada_openai | 40% | BF16 | 0.6456 | 0.6451 | -1.51% |
97112
| meta-llama/Llama-2-13b-hf | CLM | wikitext-2-raw-v1 | lambada_openai | 40% | FP32 | 0.7679 | 0.7629 | -0.65% |
@@ -100,16 +115,30 @@ The last word acc of the 1x1 pattern sparse model using [the sparseGPT script](h
100115
| decapoda-research/llama-13b-hf | CLM | wikitext-2-raw-v1 | lambada_openai | 50% | BF16 | 0.7599 | 0.7559 | -0.89% |
101116
| meta-llama/Llama-2-70b-hf | CLM | wikitext-2-raw-v1 | lambada_openai | 60% | FP32 | 0.7964 | 0.7951 | -0.16% |
102117
| meta-llama/Llama-2-70b-hf | CLM | wikitext-2-raw-v1 | lambada_openai | 60% | BF16 | 0.7937 | 0.7943 | -0.26% |
103-
| Qwen/Qwen-72B | CLM | wikitext-2-raw-v1 | lambada_openai | 60% | FP32 | - | - | - |
104-
| Qwen/Qwen-72B | CLM | wikitext-2-raw-v1 | lambada_openai | 60% | BF16 | 0.7673 | 0.7813 | - |
105-
118+
| Qwen/Qwen-72B | CLM | wikitext-2-raw-v1 | lambada_openai | 60% | FP32 | 0.7702 | 0.7859 | +2.04% |
119+
| Qwen/Qwen-72B | CLM | wikitext-2-raw-v1 | lambada_openai | 60% | BF16 | 0.7673 | 0.7813 | +1.44% |
106120

121+
<!-- discarded data -->
122+
<!-- | meta-llama/Llama-2-7b-hf | CLM | wikitext-2-raw-v1 | lambada_openai | 30% | FP32 | 0.7392 | 0.7320 | -0.97% |
123+
| meta-llama/Llama-2-7b-hf | CLM | wikitext-2-raw-v1 | lambada_openai | 30% | BF16 | 0.7361 | 0.7304 | -1.19% |
124+
| EleutherAI/gpt-j-6b | CLM | wikitext-2-raw-v1 | lambada_openai | 40% | FP32 | 0.6831 | 0.6922 | +1.33% |
125+
| EleutherAI/gpt-j-6b | CLM | wikitext-2-raw-v1 | lambada_openai | 40% | BF16 | 0.6781 | 0.6874 | +0.63% |
126+
| huggyllama/llama-7b | CLM | wikitext-2-raw-v1 | lambada_openai | 40% | FP32 | 0.7361 | 0.7332 | -0.39% |
127+
| huggyllama/llama-7b | CLM | wikitext-2-raw-v1 | lambada_openai | 40% | BF16 | 0.7308 | 0.7297 | -0.87% |
128+
| facebook/opt-6.7b | CLM | wikitext-2-raw-v1 | lambada_openai | 40% | FP32 | 0.6769 | 0.6616 | -2.26% |
129+
| facebook/opt-6.7b | CLM | wikitext-2-raw-v1 | lambada_openai | 40% | BF16 | 0.6765 | 0.6577 | -2.84% |
130+
| tiiuae/falcon-7b | CLM | wikitext-2-raw-v1 | lambada_openai | 40% | FP32 | 0.7467 | 0.7528 | +0.82% |
131+
| tiiuae/falcon-7b | CLM | wikitext-2-raw-v1 | lambada_openai | 40% | BF16 | 0.7467 | 0.7502 | +0.47% |
132+
| bigscience/bloom-7b1 | CLM | wikitext-2-raw-v1 | lambada_openai | 40% | FP32 | 0.5764 | 0.5606 | -2.74% |
133+
| bigscience/bloom-7b1 | CLM | wikitext-2-raw-v1 | lambada_openai | 40% | BF16 | 0.5731 | 0.5587 | -3.07% |
134+
| mosaicml/mpt-7b | CLM | wikitext-2-raw-v1 | lambada_openai | 40% | FP32 | 0.7056 | 0.7035 | -0.30% |
135+
| mosaicml/mpt-7b | CLM | wikitext-2-raw-v1 | lambada_openai | 40% | BF16 | 0.6831 | 0.6839 | -2.83% | -->
107136

108137
## References
109138

110139
[1] Kwon, W., Kim, S., Mahoney, M.W., Hassoun, J., Keutzer, K. and Gholami, A., 2022. A fast post-training pruning framework for transformers. Advances in Neural Information Processing Systems, 35, pp.24101-24116.
111140

112-
[2] Frantar, E. and Alistarh, D., Sparsegpt: Massive language models can be accurately pruned in one-shot, 2023. URL https://arxiv. org/abs/2301.00774.
141+
[2] Frantar, E. and Alistarh, D., 2023, July. Sparsegpt: Massive language models can be accurately pruned in one-shot. In International Conference on Machine Learning (pp. 10323-10337). PMLR.
113142

114143

115144

examples/pytorch/nlp/huggingface_models/language-modeling/pruning/eager/run_clm_sparsegpt.py

Lines changed: 15 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -49,9 +49,8 @@ def skip(*args, **kwargs):
4949
from transformers.utils import check_min_version, send_example_telemetry
5050
from transformers.utils.versions import require_version
5151
from timers import CPUTimer, GPUTimer
52-
from neural_compressor.training import WeightPruningConfig
53-
from neural_compressor.compression.pruner import (prepare_pruning,
54-
parse_auto_slim_config)
52+
from neural_compressor.training import WeightPruningConfig, prepare_pruning
53+
from neural_compressor.compression.pruner import (parse_auto_slim_config)
5554
from intel_extension_for_transformers.llm.evaluation.lm_eval import evaluate
5655

5756
check_min_version("4.27.0.dev0")
@@ -70,7 +69,7 @@ def parse_args():
7069
parser.add_argument(
7170
"--calibration_dataset_name",
7271
type=str,
73-
default="wikitext-2-raw-v1",
72+
default="NeelNanda/pile-10k", # e.g. wikitext-2-raw-v1
7473
help="The name of the pruning dataset to use (via the datasets library).",
7574
)
7675
parser.add_argument(
@@ -129,6 +128,12 @@ def parse_args():
129128
default=16,
130129
help="Batch size (per device) for the evaluation dataloader.",
131130
)
131+
parser.add_argument(
132+
"--calib_size",
133+
type=int,
134+
default=128,
135+
help="sample size for the calibration dataset.",
136+
)
132137
parser.add_argument(
133138
"--learning_rate",
134139
type=float,
@@ -403,8 +408,9 @@ def main():
403408
from_tf=bool(".ckpt" in args.model_name_or_path),
404409
config=config,
405410
trust_remote_code=args.trust_remote_code,
406-
low_cpu_mem_usage=args.low_cpu_mem_usage,
411+
low_cpu_mem_usage=args.low_cpu_mem_usage
407412
)
413+
408414

409415
else:
410416
logger.info("Training new model from scratch")
@@ -493,7 +499,7 @@ def group_texts(examples):
493499
train_dataset = lm_datasets["train"]
494500

495501
# DataLoaders creation:
496-
train_dataset = train_dataset.shuffle(seed=42).select(range(128))
502+
train_dataset = train_dataset.shuffle(seed=42).select(range(args.calib_size))
497503
total_batch_size = args.per_device_train_batch_size
498504
if local_rank != -1:
499505
total_batch_size *= WORLD_SIZE
@@ -544,8 +550,10 @@ def group_texts(examples):
544550
torch.backends.cudnn.allow_tf32 = False
545551
use_cache = model.config.use_cache
546552
model.config.use_cache = False
547-
553+
import time
554+
s = time.time()
548555
pruning = prepare_pruning(model, configs, dataloader=train_dataloader, device=device)
556+
logger.info(f"cost time: {time.time() - s}")
549557
model.config.use_cache = use_cache
550558

551559
if args.output_dir is not None:

examples/pytorch/nlp/huggingface_models/language-modeling/pruning/eager/scripts/run_llm_sparsegpt.sh

Lines changed: 2 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -11,13 +11,11 @@ export CUBLAS_WORKSPACE_CONFIG=':4096:8'
1111
#cd neural-compressor
1212
python examples/pytorch/nlp/huggingface_models/language-modeling/pruning/eager/run_clm_sparsegpt.py \
1313
--model_name_or_path /PATH/TO/LLM/ \
14-
--calibration_dataset_name wikitext-2-raw-v1 \
15-
--evaluation_dataset_name lambada \
1614
--do_prune \
1715
--device=0 \
1816
--output_dir=/PATH/TO/SAVE/ \
17+
--eval_dtype 'bf16' \
18+
--per_device_eval_batch_size 16 \
1919
--target_sparsity 0.5 \
2020
--pruning_pattern 1x1
2121

22-
23-

0 commit comments

Comments
 (0)