clevercool · Zechariah0825 · Nov 16, 2023 · Nov 16, 2023 · Nov 17, 2023 · Nov 17, 2023
diff --git a/olive_quantization/README.md b/olive_quantization/README.md
@@ -1,38 +1,56 @@
-# OliVe: Accelerating Large Language Models via Hardware-friendly Outlier-Victim Pair Quantization [[paper](https://arxiv.org/abs/2304.07493)]
+## 环境配置：
 
-![](figures/intro_victor.png)
-
-## Abstract
-
-Transformer-based large language models (LLMs) have achieved great success with the growing model size. LLMs’ size grows by 240× every two years, which outpaces the hardware progress and makes model inference increasingly costly. Model quantization is a promising approach to mitigate the widening gap between LLM size and hardware capacity. However, the existence of outliers, values with significant magnitudes, in LLMs makes existing quantization methods less effective. Prior outlier-aware quantization schemes adopt sparsity encoding techniques to separate outliers from nor- mal values where the process requires global coordination (e.g., a global sparsity coordination list). This incurs complex encod- ing/decoding hardware logics and an extra orchestration controller for the computation between outlier and normal values. As such, it is not hardware-efficient and hence only achieves sub-optimal quantization benefits.
-
-We propose OliVe, an algorithm/architecture co-designed so- lution that adopts an outlier-victim pair (OVP) quantization and handles outlier values locally with low hardware overheads and high performance gains. The key insight of OliVe is that outliers are important while the normal values next to them are not. Thus those normal values (called victims) can be sacrificed to accommodate outliers. This enables a memory-aligned OVP encoding scheme, which can be efficiently integrated to the existing hardware accel- erators like systolic array and tensor core. As a result, OliVe-based accelerator surpasses the existing outlier-aware accelerator, GOBO, by 4.5× speedup and 4.0× energy reduction, respectively, with a superior model accuracy.
-
-## Environment
-```bash
+```Plain Text
 conda create -n OliVe python=3.8
 conda activate OliVe
 
 conda install pytorch=1.11.0 torchvision==0.12.0 torchaudio==0.11.0 cudatoolkit=11.3 -c pytorch
+
+cd ./olive_quantization
+
 pip install -r requirements.txt
 
 pip install ./quant
 ```
 
-## Paper's Hardware Configuration
 
-+ AMD EPYC 7302 16-Core Processor
-+ NVIDIA A40 GPU (48GB)
 
-## Usage
-### BERT / BART
+## 适配LLAMA：
+
+配好环境后，在conda环境中更新这些包：
+
+```Plain Text
+pip install --upgrade evaluate
+pip install datasets -U
+pip install --upgrade transformers==4.33
+!pip install accelerate==0.20.3
+```
+
+
+
+## 运行Olive：
+
+```Plain Text
+cd olive_quantization/llm
+./scripts/run_all.sh
+```
+
+run_all.sh中的运行示例：可以按照实验要求手动修改
+
+```Plain Text
+CUDA_VISIBLE_DEVICES=1 ./scripts/clm_run.sh LLAMA/llama-7b c4 realnewslike ant-int-flint 4 2 46666 outlier
+```
+
+其中：
 
-We adopt the BERT and BART models for the NLP task with five datasets, MNLI, CoLA, SST-2, QQP and MRPC.
+LLAMA/llama-7b：是存放模型软连接的文件夹，改成opt模型的话：OPT/opt-7b
 
-For reproducing the results in the paper, please refer to `./bert`.
+c4 realnewslike：是数据集选择，选Wikitext数据集改成：wikitext wikitext-103-raw-v1
 
-### Large Language Models
+4：是bit选择，默认是8bit，这里是4bit
 
-We adopt the GPT-2, OPT and Bloom models for the NLP task with two datasets, wikitext and C4.
+2：batch_size大小
 
-For reproducing the results in the paper, please refer to `./llm`.
+所有脚本参数的设定在 clm_run.sh 文件里
+实验结果在./llm/checkpoints中
+实验日志在./llm/log中
diff --git a/olive_quantization/llm/=0.20.3 b/olive_quantization/llm/=0.20.3
@@ -0,0 +1,7 @@
+Requirement already satisfied: accelerate in /home/gaozh/Software/miniconda3/envs/OliVe/lib/python3.8/site-packages (0.16.0)
+Requirement already satisfied: numpy>=1.17 in /home/gaozh/Software/miniconda3/envs/OliVe/lib/python3.8/site-packages (from accelerate) (1.24.3)
+Requirement already satisfied: packaging>=20.0 in /home/gaozh/Software/miniconda3/envs/OliVe/lib/python3.8/site-packages (from accelerate) (23.2)
+Requirement already satisfied: psutil in /home/gaozh/Software/miniconda3/envs/OliVe/lib/python3.8/site-packages (from accelerate) (5.9.6)
+Requirement already satisfied: pyyaml in /home/gaozh/Software/miniconda3/envs/OliVe/lib/python3.8/site-packages (from accelerate) (6.0.1)
+Requirement already satisfied: torch>=1.4.0 in /home/gaozh/Software/miniconda3/envs/OliVe/lib/python3.8/site-packages (from accelerate) (1.11.0)
+Requirement already satisfied: typing_extensions in /home/gaozh/Software/miniconda3/envs/OliVe/lib/python3.8/site-packages (from torch>=1.4.0->accelerate) (4.7.1)
diff --git a/olive_quantization/llm/LLAMA/llama-7b b/olive_quantization/llm/LLAMA/llama-7b
@@ -0,0 +1 @@
+/data/gaozh/llama1-hf/7B/
diff --git a/olive_quantization/llm/OPT/opt-125m b/olive_quantization/llm/OPT/opt-125m
@@ -0,0 +1 @@
+/data/gaozh/opt/opt-125m/
diff --git a/olive_quantization/llm/OPT/opt-6.7b b/olive_quantization/llm/OPT/opt-6.7b
@@ -0,0 +1 @@
+/data/gaozh/opt/opt-6.7b/
diff --git a/olive_quantization/llm/accuracy/accuracy.py b/olive_quantization/llm/accuracy/accuracy.py
@@ -0,0 +1,106 @@
+# Copyright 2020 The HuggingFace Datasets Authors and the current dataset script contributor.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Accuracy metric."""
+
+import datasets
+from sklearn.metrics import accuracy_score
+
+import evaluate
+
+
+_DESCRIPTION = """
+Accuracy is the proportion of correct predictions among the total number of cases processed. It can be computed with:
+Accuracy = (TP + TN) / (TP + TN + FP + FN)
+ Where:
+TP: True positive
+TN: True negative
+FP: False positive
+FN: False negative
+"""
+
+
+_KWARGS_DESCRIPTION = """
+Args:
+    predictions (`list` of `int`): Predicted labels.
+    references (`list` of `int`): Ground truth labels.
+    normalize (`boolean`): If set to False, returns the number of correctly classified samples. Otherwise, returns the fraction of correctly classified samples. Defaults to True.
+    sample_weight (`list` of `float`): Sample weights Defaults to None.
+
+Returns:
+    accuracy (`float` or `int`): Accuracy score. Minimum possible value is 0. Maximum possible value is 1.0, or the number of examples input, if `normalize` is set to `True`.. A higher score means higher accuracy.
+
+Examples:
+
+    Example 1-A simple example
+        >>> accuracy_metric = evaluate.load("accuracy")
+        >>> results = accuracy_metric.compute(references=[0, 1, 2, 0, 1, 2], predictions=[0, 1, 1, 2, 1, 0])
+        >>> print(results)
+        {'accuracy': 0.5}
+
+    Example 2-The same as Example 1, except with `normalize` set to `False`.
+        >>> accuracy_metric = evaluate.load("accuracy")
+        >>> results = accuracy_metric.compute(references=[0, 1, 2, 0, 1, 2], predictions=[0, 1, 1, 2, 1, 0], normalize=False)
+        >>> print(results)
+        {'accuracy': 3.0}
+
+    Example 3-The same as Example 1, except with `sample_weight` set.
+        >>> accuracy_metric = evaluate.load("accuracy")
+        >>> results = accuracy_metric.compute(references=[0, 1, 2, 0, 1, 2], predictions=[0, 1, 1, 2, 1, 0], sample_weight=[0.5, 2, 0.7, 0.5, 9, 0.4])
+        >>> print(results)
+        {'accuracy': 0.8778625954198473}
+"""
+
+
+_CITATION = """
+@article{scikit-learn,
+  title={Scikit-learn: Machine Learning in {P}ython},
+  author={Pedregosa, F. and Varoquaux, G. and Gramfort, A. and Michel, V.
+         and Thirion, B. and Grisel, O. and Blondel, M. and Prettenhofer, P.
+         and Weiss, R. and Dubourg, V. and Vanderplas, J. and Passos, A. and
+         Cournapeau, D. and Brucher, M. and Perrot, M. and Duchesnay, E.},
+  journal={Journal of Machine Learning Research},
+  volume={12},
+  pages={2825--2830},
+  year={2011}
+}
+"""
+
+
+@evaluate.utils.file_utils.add_start_docstrings(_DESCRIPTION, _KWARGS_DESCRIPTION)
+class Accuracy(evaluate.Metric):
+    def _info(self):
+        return evaluate.MetricInfo(
+            description=_DESCRIPTION,
+            citation=_CITATION,
+            inputs_description=_KWARGS_DESCRIPTION,
+            features=datasets.Features(
+                {
+                    "predictions": datasets.Sequence(datasets.Value("int32")),
+                    "references": datasets.Sequence(datasets.Value("int32")),
+                }
+                if self.config_name == "multilabel"
+                else {
+                    "predictions": datasets.Value("int32"),
+                    "references": datasets.Value("int32"),
+                }
+            ),
+            reference_urls=["https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html"],
+        )
+
+    def _compute(self, predictions, references, normalize=True, sample_weight=None):
+        return {
+            "accuracy": float(
+                accuracy_score(references, predictions, normalize=normalize, sample_weight=sample_weight)
+            )
+        }
diff --git a/olive_quantization/llm/checkpoints/LLAMA/llama-7b/all_results.json b/olive_quantization/llm/checkpoints/LLAMA/llama-7b/all_results.json
@@ -0,0 +1,9 @@
+{
+    "eval_accuracy": 0.24537370580455747,
+    "eval_loss": 5.000532150268555,
+    "eval_runtime": 801.0457,
+    "eval_samples": 289,
+    "eval_samples_per_second": 0.361,
+    "eval_steps_per_second": 0.181,
+    "perplexity": 148.49215822288735
+}
diff --git a/olive_quantization/llm/checkpoints/LLAMA/llama-7b/eval_results.json b/olive_quantization/llm/checkpoints/LLAMA/llama-7b/eval_results.json
@@ -0,0 +1,9 @@
+{
+    "eval_accuracy": 0.24537370580455747,
+    "eval_loss": 5.000532150268555,
+    "eval_runtime": 801.0457,
+    "eval_samples": 289,
+    "eval_samples_per_second": 0.361,
+    "eval_steps_per_second": 0.181,
+    "perplexity": 148.49215822288735
+}
diff --git a/olive_quantization/llm/checkpoints/facebook/opt-125m/README.md b/olive_quantization/llm/checkpoints/facebook/opt-125m/README.md
@@ -0,0 +1,57 @@
+---
+license: other
+tags:
+- generated_from_trainer
+datasets:
+- wikitext
+model-index:
+- name: opt-125m
+  results: []
+---
+
+<!-- This model card has been generated automatically according to the information the Trainer had access to. You
+should probably proofread and complete it, then remove this comment. -->
+
+# opt-125m
+
+This model is a fine-tuned version of [facebook/opt-125m](https://huggingface.co/facebook/opt-125m) on the wikitext wikitext-103-raw-v1 dataset.
+It achieves the following results on the evaluation set:
+- eval_loss: 4.4711
+- eval_accuracy: 0.2692
+- eval_runtime: 37.8287
+- eval_samples_per_second: 6.424
+- eval_steps_per_second: 3.225
+- step: 0
+
+## Model description
+
+More information needed
+
+## Intended uses & limitations
+
+More information needed
+
+## Training and evaluation data
+
+More information needed
+
+## Training procedure
+
+### Training hyperparameters
+
+The following hyperparameters were used during training:
+- learning_rate: 5e-05
+- train_batch_size: 2
+- eval_batch_size: 2
+- seed: 42
+- distributed_type: multi-GPU
+- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
+- lr_scheduler_type: linear
+- num_epochs: 3.0
+
+### Framework versions
+
+- Transformers 4.26.1
+- Pytorch 1.11.0
+- Datasets 2.15.0
+- Tokenizers 0.13.3
diff --git a/olive_quantization/llm/checkpoints/facebook/opt-125m/all_results.json b/olive_quantization/llm/checkpoints/facebook/opt-125m/all_results.json
@@ -0,0 +1,9 @@
+{
+    "eval_accuracy": 0.2692234974194353,
+    "eval_loss": 4.471142292022705,
+    "eval_runtime": 37.8287,
+    "eval_samples": 243,
+    "eval_samples_per_second": 6.424,
+    "eval_steps_per_second": 3.225,
+    "perplexity": 87.45656691585893
+}
diff --git a/olive_quantization/llm/checkpoints/facebook/opt-125m/eval_results.json b/olive_quantization/llm/checkpoints/facebook/opt-125m/eval_results.json
@@ -0,0 +1,9 @@
+{
+    "eval_accuracy": 0.2692234974194353,
+    "eval_loss": 4.471142292022705,
+    "eval_runtime": 37.8287,
+    "eval_samples": 243,
+    "eval_samples_per_second": 6.424,
+    "eval_steps_per_second": 3.225,
+    "perplexity": 87.45656691585893
+}
diff --git a/olive_quantization/llm/run_clm.py b/olive_quantization/llm/run_clm.py
@@ -526,13 +526,15 @@ def tokenize_function(examples):
                 "Picking 1024 instead. You can change that default value by passing --block_size xxx."
             )
             block_size = 1024
+            tokenizer.model_max_length = block_size/4
     else:
         if data_args.block_size > tokenizer.model_max_length:
             logger.warning(
                 f"The block_size passed ({data_args.block_size}) is larger than the maximum length for the model"
                 f"({tokenizer.model_max_length}). Using block_size={tokenizer.model_max_length}."
             )
         block_size = min(data_args.block_size, tokenizer.model_max_length)
+        tokenizer.model_max_length = block_size/4
 
     # Main data processing function that will concatenate all texts from our dataset and generate chunks of block_size.
     def group_texts(examples):
@@ -590,7 +592,7 @@ def preprocess_logits_for_metrics(logits, labels):
                 logits = logits[0]
             return logits.argmax(dim=-1)
 
-        metric = evaluate.load("accuracy")
+        metric = evaluate.load("./accuracy/accuracy.py")
 
         def compute_metrics(eval_preds):
             preds, labels = eval_preds

diff --git a/olive_quantization/llm/scripts/clm_run copy.sh b/olive_quantization/llm/scripts/clm_run copy.sh
@@ -0,0 +1,29 @@
+transformer_model=${1:-"gpt2"}
+dataset=${2:-"wikitext"}
+dataset_config=${3:-"wikitext-103-raw-v1"}
+q_mode=${4:-"ant-int-flint"}
+q_bit=${5:-"4"}
+batch_size=${6:-"8"}
+port=${7:-46666}
+desc=${8:-""}
+n8=${9:-"0"}
+
+mkdir -p ./log
+mkdir -p ./log/bigscience
+mkdir -p ./log/facebook
+
+log_name=""
+if [ "$dataset" = "wikitext" ] ; then
+  log_name=$transformer_model"_"$dataset_config"_"$q_bit"bit_batch"$batch_size"_"$desc
+else
+  log_name=$transformer_model"_"$dataset"_"$q_bit"bit_batch"$batch_size"_"$desc
+fi
+
+python -u -m torch.distributed.launch --nproc_per_node=1 --master_port $port run_clm.py \
+  --model_name_or_path $transformer_model \
+  --dataset_name $dataset --dataset_config_name $dataset_config \
+  --output_dir checkpoints/$transformer_model \
+  --do_eval \
+  --mode=$q_mode --wbit=$q_bit --abit=$q_bit --a_low=75 --a_up=250 --w_low=75 --w_up=250 --layer_8bit_n=$n8 \
+  --eval_batch_size=$batch_size --train_batch_size=$batch_size --quantize_batch_size=$batch_size \
+  2>&1 | tee ./log/${log_name}.log \
diff --git a/olive_quantization/llm/scripts/clm_run.sh b/olive_quantization/llm/scripts/clm_run.sh
@@ -3,14 +3,14 @@ dataset=${2:-"wikitext"}
 dataset_config=${3:-"wikitext-103-raw-v1"}
 q_mode=${4:-"ant-int-flint"}
 q_bit=${5:-"4"}
-batch_size=${6:-"8"}
+batch_size=${6:-"4"}
 port=${7:-46666}
 desc=${8:-""}
 n8=${9:-"0"}
 
 mkdir -p ./log
-mkdir -p ./log/bigscience
-mkdir -p ./log/facebook
+mkdir -p ./log/LLAMA
+mkdir -p ./log/OPT
 
 log_name=""
 if [ "$dataset" = "wikitext" ] ; then
@@ -19,7 +19,7 @@ else
   log_name=$transformer_model"_"$dataset"_"$q_bit"bit_batch"$batch_size"_"$desc
 fi
 
-python -u -m torch.distributed.launch --nproc_per_node=1 --master_port $port run_clm.py \
+torchrun --nproc_per_node=1 --master_port $port run_clm.py \
   --model_name_or_path $transformer_model \
   --dataset_name $dataset --dataset_config_name $dataset_config \
   --output_dir checkpoints/$transformer_model \