Skip to content

Commit

Permalink
Add TF Example gpt-j-6B (#1446)
Browse files Browse the repository at this point in the history
Signed-off-by: zehao-intel <zehao.huang@intel.com>
  • Loading branch information
zehao-intel authored Dec 6, 2023
1 parent 5d33a53 commit 4f2c35d
Show file tree
Hide file tree
Showing 9 changed files with 635 additions and 0 deletions.
7 changes: 7 additions & 0 deletions examples/.config/model_params_tensorflow.json
Original file line number Diff line number Diff line change
Expand Up @@ -1787,6 +1787,13 @@
"main_script": "main.py",
"batch_size": 16
},
"gpt-j-6B": {
"model_src_dir": "nlp/large_language_models/quantization/ptq/gpt-j",
"dataset_location": "",
"input_model": "/tf_dataset2/models/tensorflow/gpt-j-6B",
"main_script": "main.py",
"batch_size": 1
},
"opt_125m_sq": {
"model_src_dir": "nlp/large_language_models/quantization/ptq/smoothquant",
"dataset_location": "",
Expand Down
6 changes: 6 additions & 0 deletions examples/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -289,6 +289,12 @@ Intel® Neural Compressor validated examples with multiple compression technique
<td>Post-Training Static Quantization</td>
<td><a href="./tensorflow/graph_networks/graphsage/">pb</a></td>
</tr>
<tr>
<td>EleutherAI/gpt-j-6B</td>
<td>Natural Language Processing</td>
<td>Post-Training Static Quantization</td>
<td><a href="./tensorflow/nlp/large_language_models/quantization/ptq/gpt-j">saved_model (smooth quant)</a></td>
</tr>
</tbody>
</table>

Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,157 @@
Step-by-Step
============

This document is used to list steps of reproducing TensorFlow Intel® Neural Compressor smooth quantization of language models gpt-j-6B.

# Prerequisite

## 1. Environment

### Installation
```shell
# Install Intel® Neural Compressor
pip install neural-compressor
pip install -r requirements
```

## 2. Prepare Pretrained model
Run the follow script to download gpt-j-6B saved_model to ```./gpt-j-6B```:
```
bash prepare_model.sh
```

## 3. Install TensorFlow 2.11.dev202242
Build a TensorFlow pip package from [intel-tensorflow spr_ww42 branch](https://github.com/Intel-tensorflow/tensorflow/tree/spr_ww42) and install it. How to build a TensorFlow pip package from source please refer to this [tutorial](https://www.tensorflow.org/install/source).

The performance of int8 gpt-j-6B would be better once intel-tensorflow for gnr is released.

## 4. Prepare Dataset
The dataset will be automatically loaded.

# Run

## Smooth Quantization

```shell
bash run_quant.sh --input_model=<FP32_MODEL_PATH> --output_model=<INT8_MODEL_PATH>
```

## Benchmark

### Evaluate Performance

```shell
bash run_benchmark.sh --input_model=<MODEL_PATH> --mode=benchmark
```

### Evaluate Accuracy

```shell
bash run_benchmark.sh --input_model=<MODEL_PATH> --mode=accuracy
```


Details of enabling Intel® Neural Compressor on gpt-j-6B for TensorFlow
=========================

This is a tutorial of how to enable gpt-j-6B model with Intel® Neural Compressor.
## User Code Analysis

User specifies fp32 *model*, calibration dataloader *q_dataloader* and a custom *eval_func* which encapsulates the evaluation dataloader and metric by itself.

### calib_dataloader Part Adaption
Below dataloader class uses generator function to provide the model with input.

```python
class MyDataloader:
def __init__(self, dataset, batch_size=1):
self.dataset = dataset
self.batch_size = batch_size
self.length = math.ceil(len(dataset) / self.batch_size)

def generate_data(self, data, pad_token_id=50256):
input_ids = tf.convert_to_tensor([data[:-1]], dtype=tf.int32)
cur_len = len(data)-1
input_ids_padding = tf.ones((self.batch_size, 1), dtype=tf.int32) * (pad_token_id or 0)
generated = tf.concat([input_ids, input_ids_padding], axis=-1)
model_kwargs = {'attention_mask': prepare_attention_mask_for_generation(input_ids)}
if model_kwargs.get("past_key_values") is None:
input_ids = generated[:, :cur_len]
else:
input_ids = tf.expand_dims(generated[:, cur_len - 1], -1)
return model_kwargs['attention_mask'], input_ids

def __iter__(self):
labels = None
for _, data in enumerate(self.dataset):
cur_input = self.generate_data(data)
yield (cur_input, labels)

def __len__(self):
return self.length
```

### Quantization Config
The Quantization Config class has default parameters setting for running on Intel CPUs. If running this example on Intel GPUs, the 'backend' parameter should be set to 'itex' and the 'device' parameter should be set to 'gpu'.

```python
config = PostTrainingQuantConfig(
device="gpu",
backend="itex",
...
)
```

### Code Update
After prepare step is done, we add the code for quantization tuning to generate quantized model.

Firstly, let's load a INC inner class model from the path of gpt-j-6B saved_model.
```python
from neural_compressor import Model
model = Model(run_args.input_model, modelType='llm_saved_model')
```

#### Tune

To apply quantization, the function that maps names from AutoTrackable variables to graph nodes must be defined to match names of nodes in different format.
```python
def weight_name_mapping(name):
"""The function that maps name from AutoTrackable variables to graph nodes"""
name = name.replace('tfgptj_for_causal_lm', 'StatefulPartitionedCall')
name = name.replace('kernel:0', 'Tensordot/ReadVariableOp')
return name
```

Please use the recipe to set smooth quantization.
```python
from neural_compressor import quantization, PostTrainingQuantConfig
calib_dataloader = MyDataloader(mydata, batch_size=run_args.batch_size)
recipes = {"smooth_quant": True, "smooth_quant_args": {'alpha': 0.52705}}
conf = PostTrainingQuantConfig(quant_level=1,
excluded_precisions=["bf16"],##use basic tuning
recipes=recipes,
calibration_sampling_size=[1],
)

model.weight_name_mapping = weight_name_mapping
q_model = quantization.fit( model,
conf,
eval_func=evaluate,
calib_dataloader=calib_dataloader)

q_model.save(run_args.output_model)
```
#### Benchmark
```python
if run_args.mode == "performance":
from neural_compressor.benchmark import fit
from neural_compressor.config import BenchmarkConfig
conf = BenchmarkConfig(warmup=10, iteration=run_args.iteration, cores_per_instance=4, num_of_instance=1)
fit(model, conf, b_func=evaluate)
elif run_args.mode == "accuracy":
acc_result = evaluate(model.model)
print("Batch size = %d" % run_args.batch_size)
print("Accuracy: %.5f" % acc_result)
```

The Intel® Neural Compressor quantization.fit() function will return a best quantized model under time constraint.
Loading

0 comments on commit 4f2c35d

Please sign in to comment.