Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
60 changes: 39 additions & 21 deletions olive_quantization/README.md
Original file line number Diff line number Diff line change
@@ -1,38 +1,56 @@
# OliVe: Accelerating Large Language Models via Hardware-friendly Outlier-Victim Pair Quantization [[paper](https://arxiv.org/abs/2304.07493)]
## 环境配置:

![](figures/intro_victor.png)

## Abstract

Transformer-based large language models (LLMs) have achieved great success with the growing model size. LLMs’ size grows by 240× every two years, which outpaces the hardware progress and makes model inference increasingly costly. Model quantization is a promising approach to mitigate the widening gap between LLM size and hardware capacity. However, the existence of outliers, values with significant magnitudes, in LLMs makes existing quantization methods less effective. Prior outlier-aware quantization schemes adopt sparsity encoding techniques to separate outliers from nor- mal values where the process requires global coordination (e.g., a global sparsity coordination list). This incurs complex encod- ing/decoding hardware logics and an extra orchestration controller for the computation between outlier and normal values. As such, it is not hardware-efficient and hence only achieves sub-optimal quantization benefits.

We propose OliVe, an algorithm/architecture co-designed so- lution that adopts an outlier-victim pair (OVP) quantization and handles outlier values locally with low hardware overheads and high performance gains. The key insight of OliVe is that outliers are important while the normal values next to them are not. Thus those normal values (called victims) can be sacrificed to accommodate outliers. This enables a memory-aligned OVP encoding scheme, which can be efficiently integrated to the existing hardware accel- erators like systolic array and tensor core. As a result, OliVe-based accelerator surpasses the existing outlier-aware accelerator, GOBO, by 4.5× speedup and 4.0× energy reduction, respectively, with a superior model accuracy.

## Environment
```bash
```Plain Text
conda create -n OliVe python=3.8
conda activate OliVe

conda install pytorch=1.11.0 torchvision==0.12.0 torchaudio==0.11.0 cudatoolkit=11.3 -c pytorch

cd ./olive_quantization

pip install -r requirements.txt

pip install ./quant
```

## Paper's Hardware Configuration

+ AMD EPYC 7302 16-Core Processor
+ NVIDIA A40 GPU (48GB)

## Usage
### BERT / BART
## 适配LLAMA:

配好环境后,在conda环境中更新这些包:

```Plain Text
pip install --upgrade evaluate
pip install datasets -U
pip install --upgrade transformers==4.33
!pip install accelerate==0.20.3
```



## 运行Olive:

```Plain Text
cd olive_quantization/llm
./scripts/run_all.sh
```

run_all.sh中的运行示例:可以按照实验要求手动修改

```Plain Text
CUDA_VISIBLE_DEVICES=1 ./scripts/clm_run.sh LLAMA/llama-7b c4 realnewslike ant-int-flint 4 2 46666 outlier
```

其中:

We adopt the BERT and BART models for the NLP task with five datasets, MNLI, CoLA, SST-2, QQP and MRPC.
LLAMA/llama-7b:是存放模型软连接的文件夹,改成opt模型的话:OPT/opt-7b

For reproducing the results in the paper, please refer to `./bert`.
c4 realnewslike:是数据集选择,选Wikitext数据集改成:wikitext wikitext-103-raw-v1

### Large Language Models
4:是bit选择,默认是8bit,这里是4bit

We adopt the GPT-2, OPT and Bloom models for the NLP task with two datasets, wikitext and C4.
2:batch_size大小

For reproducing the results in the paper, please refer to `./llm`.
所有脚本参数的设定在 clm_run.sh 文件里
实验结果在./llm/checkpoints中
实验日志在./llm/log中
7 changes: 7 additions & 0 deletions olive_quantization/llm/=0.20.3
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
Requirement already satisfied: accelerate in /home/gaozh/Software/miniconda3/envs/OliVe/lib/python3.8/site-packages (0.16.0)
Requirement already satisfied: numpy>=1.17 in /home/gaozh/Software/miniconda3/envs/OliVe/lib/python3.8/site-packages (from accelerate) (1.24.3)
Requirement already satisfied: packaging>=20.0 in /home/gaozh/Software/miniconda3/envs/OliVe/lib/python3.8/site-packages (from accelerate) (23.2)
Requirement already satisfied: psutil in /home/gaozh/Software/miniconda3/envs/OliVe/lib/python3.8/site-packages (from accelerate) (5.9.6)
Requirement already satisfied: pyyaml in /home/gaozh/Software/miniconda3/envs/OliVe/lib/python3.8/site-packages (from accelerate) (6.0.1)
Requirement already satisfied: torch>=1.4.0 in /home/gaozh/Software/miniconda3/envs/OliVe/lib/python3.8/site-packages (from accelerate) (1.11.0)
Requirement already satisfied: typing_extensions in /home/gaozh/Software/miniconda3/envs/OliVe/lib/python3.8/site-packages (from torch>=1.4.0->accelerate) (4.7.1)
1 change: 1 addition & 0 deletions olive_quantization/llm/LLAMA/llama-7b
1 change: 1 addition & 0 deletions olive_quantization/llm/OPT/opt-125m
1 change: 1 addition & 0 deletions olive_quantization/llm/OPT/opt-6.7b
106 changes: 106 additions & 0 deletions olive_quantization/llm/accuracy/accuracy.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,106 @@
# Copyright 2020 The HuggingFace Datasets Authors and the current dataset script contributor.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Accuracy metric."""

import datasets
from sklearn.metrics import accuracy_score

import evaluate


_DESCRIPTION = """
Accuracy is the proportion of correct predictions among the total number of cases processed. It can be computed with:
Accuracy = (TP + TN) / (TP + TN + FP + FN)
Where:
TP: True positive
TN: True negative
FP: False positive
FN: False negative
"""


_KWARGS_DESCRIPTION = """
Args:
predictions (`list` of `int`): Predicted labels.
references (`list` of `int`): Ground truth labels.
normalize (`boolean`): If set to False, returns the number of correctly classified samples. Otherwise, returns the fraction of correctly classified samples. Defaults to True.
sample_weight (`list` of `float`): Sample weights Defaults to None.

Returns:
accuracy (`float` or `int`): Accuracy score. Minimum possible value is 0. Maximum possible value is 1.0, or the number of examples input, if `normalize` is set to `True`.. A higher score means higher accuracy.

Examples:

Example 1-A simple example
>>> accuracy_metric = evaluate.load("accuracy")
>>> results = accuracy_metric.compute(references=[0, 1, 2, 0, 1, 2], predictions=[0, 1, 1, 2, 1, 0])
>>> print(results)
{'accuracy': 0.5}

Example 2-The same as Example 1, except with `normalize` set to `False`.
>>> accuracy_metric = evaluate.load("accuracy")
>>> results = accuracy_metric.compute(references=[0, 1, 2, 0, 1, 2], predictions=[0, 1, 1, 2, 1, 0], normalize=False)
>>> print(results)
{'accuracy': 3.0}

Example 3-The same as Example 1, except with `sample_weight` set.
>>> accuracy_metric = evaluate.load("accuracy")
>>> results = accuracy_metric.compute(references=[0, 1, 2, 0, 1, 2], predictions=[0, 1, 1, 2, 1, 0], sample_weight=[0.5, 2, 0.7, 0.5, 9, 0.4])
>>> print(results)
{'accuracy': 0.8778625954198473}
"""


_CITATION = """
@article{scikit-learn,
title={Scikit-learn: Machine Learning in {P}ython},
author={Pedregosa, F. and Varoquaux, G. and Gramfort, A. and Michel, V.
and Thirion, B. and Grisel, O. and Blondel, M. and Prettenhofer, P.
and Weiss, R. and Dubourg, V. and Vanderplas, J. and Passos, A. and
Cournapeau, D. and Brucher, M. and Perrot, M. and Duchesnay, E.},
journal={Journal of Machine Learning Research},
volume={12},
pages={2825--2830},
year={2011}
}
"""


@evaluate.utils.file_utils.add_start_docstrings(_DESCRIPTION, _KWARGS_DESCRIPTION)
class Accuracy(evaluate.Metric):
def _info(self):
return evaluate.MetricInfo(
description=_DESCRIPTION,
citation=_CITATION,
inputs_description=_KWARGS_DESCRIPTION,
features=datasets.Features(
{
"predictions": datasets.Sequence(datasets.Value("int32")),
"references": datasets.Sequence(datasets.Value("int32")),
}
if self.config_name == "multilabel"
else {
"predictions": datasets.Value("int32"),
"references": datasets.Value("int32"),
}
),
reference_urls=["https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html"],
)

def _compute(self, predictions, references, normalize=True, sample_weight=None):
return {
"accuracy": float(
accuracy_score(references, predictions, normalize=normalize, sample_weight=sample_weight)
)
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
{
"eval_accuracy": 0.24537370580455747,
"eval_loss": 5.000532150268555,
"eval_runtime": 801.0457,
"eval_samples": 289,
"eval_samples_per_second": 0.361,
"eval_steps_per_second": 0.181,
"perplexity": 148.49215822288735
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
{
"eval_accuracy": 0.24537370580455747,
"eval_loss": 5.000532150268555,
"eval_runtime": 801.0457,
"eval_samples": 289,
"eval_samples_per_second": 0.361,
"eval_steps_per_second": 0.181,
"perplexity": 148.49215822288735
}
57 changes: 57 additions & 0 deletions olive_quantization/llm/checkpoints/facebook/opt-125m/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
---
license: other
tags:
- generated_from_trainer
datasets:
- wikitext
model-index:
- name: opt-125m
results: []
---

<!-- This model card has been generated automatically according to the information the Trainer had access to. You
should probably proofread and complete it, then remove this comment. -->

# opt-125m

This model is a fine-tuned version of [facebook/opt-125m](https://huggingface.co/facebook/opt-125m) on the wikitext wikitext-103-raw-v1 dataset.
It achieves the following results on the evaluation set:
- eval_loss: 4.4711
- eval_accuracy: 0.2692
- eval_runtime: 37.8287
- eval_samples_per_second: 6.424
- eval_steps_per_second: 3.225
- step: 0

## Model description

More information needed

## Intended uses & limitations

More information needed

## Training and evaluation data

More information needed

## Training procedure

### Training hyperparameters

The following hyperparameters were used during training:
- learning_rate: 5e-05
- train_batch_size: 2
- eval_batch_size: 2
- seed: 42
- distributed_type: multi-GPU
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- num_epochs: 3.0

### Framework versions

- Transformers 4.26.1
- Pytorch 1.11.0
- Datasets 2.15.0
- Tokenizers 0.13.3
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
{
"eval_accuracy": 0.2692234974194353,
"eval_loss": 4.471142292022705,
"eval_runtime": 37.8287,
"eval_samples": 243,
"eval_samples_per_second": 6.424,
"eval_steps_per_second": 3.225,
"perplexity": 87.45656691585893
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
{
"eval_accuracy": 0.2692234974194353,
"eval_loss": 4.471142292022705,
"eval_runtime": 37.8287,
"eval_samples": 243,
"eval_samples_per_second": 6.424,
"eval_steps_per_second": 3.225,
"perplexity": 87.45656691585893
}
4 changes: 3 additions & 1 deletion olive_quantization/llm/run_clm.py
Original file line number Diff line number Diff line change
Expand Up @@ -526,13 +526,15 @@ def tokenize_function(examples):
"Picking 1024 instead. You can change that default value by passing --block_size xxx."
)
block_size = 1024
tokenizer.model_max_length = block_size/4
else:
if data_args.block_size > tokenizer.model_max_length:
logger.warning(
f"The block_size passed ({data_args.block_size}) is larger than the maximum length for the model"
f"({tokenizer.model_max_length}). Using block_size={tokenizer.model_max_length}."
)
block_size = min(data_args.block_size, tokenizer.model_max_length)
tokenizer.model_max_length = block_size/4

# Main data processing function that will concatenate all texts from our dataset and generate chunks of block_size.
def group_texts(examples):
Expand Down Expand Up @@ -590,7 +592,7 @@ def preprocess_logits_for_metrics(logits, labels):
logits = logits[0]
return logits.argmax(dim=-1)

metric = evaluate.load("accuracy")
metric = evaluate.load("./accuracy/accuracy.py")

def compute_metrics(eval_preds):
preds, labels = eval_preds
Expand Down
29 changes: 29 additions & 0 deletions olive_quantization/llm/scripts/clm_run copy.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
transformer_model=${1:-"gpt2"}
dataset=${2:-"wikitext"}
dataset_config=${3:-"wikitext-103-raw-v1"}
q_mode=${4:-"ant-int-flint"}
q_bit=${5:-"4"}
batch_size=${6:-"8"}
port=${7:-46666}
desc=${8:-""}
n8=${9:-"0"}

mkdir -p ./log
mkdir -p ./log/bigscience
mkdir -p ./log/facebook

log_name=""
if [ "$dataset" = "wikitext" ] ; then
log_name=$transformer_model"_"$dataset_config"_"$q_bit"bit_batch"$batch_size"_"$desc
else
log_name=$transformer_model"_"$dataset"_"$q_bit"bit_batch"$batch_size"_"$desc
fi

python -u -m torch.distributed.launch --nproc_per_node=1 --master_port $port run_clm.py \
--model_name_or_path $transformer_model \
--dataset_name $dataset --dataset_config_name $dataset_config \
--output_dir checkpoints/$transformer_model \
--do_eval \
--mode=$q_mode --wbit=$q_bit --abit=$q_bit --a_low=75 --a_up=250 --w_low=75 --w_up=250 --layer_8bit_n=$n8 \
--eval_batch_size=$batch_size --train_batch_size=$batch_size --quantize_batch_size=$batch_size \
2>&1 | tee ./log/${log_name}.log \
8 changes: 4 additions & 4 deletions olive_quantization/llm/scripts/clm_run.sh
Original file line number Diff line number Diff line change
Expand Up @@ -3,14 +3,14 @@ dataset=${2:-"wikitext"}
dataset_config=${3:-"wikitext-103-raw-v1"}
q_mode=${4:-"ant-int-flint"}
q_bit=${5:-"4"}
batch_size=${6:-"8"}
batch_size=${6:-"4"}
port=${7:-46666}
desc=${8:-""}
n8=${9:-"0"}

mkdir -p ./log
mkdir -p ./log/bigscience
mkdir -p ./log/facebook
mkdir -p ./log/LLAMA
mkdir -p ./log/OPT

log_name=""
if [ "$dataset" = "wikitext" ] ; then
Expand All @@ -19,7 +19,7 @@ else
log_name=$transformer_model"_"$dataset"_"$q_bit"bit_batch"$batch_size"_"$desc
fi

python -u -m torch.distributed.launch --nproc_per_node=1 --master_port $port run_clm.py \
torchrun --nproc_per_node=1 --master_port $port run_clm.py \
--model_name_or_path $transformer_model \
--dataset_name $dataset --dataset_config_name $dataset_config \
--output_dir checkpoints/$transformer_model \
Expand Down
Loading