Skip to content

Commit

Permalink
Merge with main (#1)
Browse files Browse the repository at this point in the history
* Update beam_search_topk_kernels.cu

fix: fix bug of beam search

* fix: change int of some kernels to int64_t to prevent overflow

* fix: gpt tensor shapes inconsistency (NVIDIA#505)

Signed-off-by: AkiyamaYummy <842720660@qq.com>

* Update gpt_guide.md (NVIDIA#529)

* fix: fix bug of gpt buffer and gpt gemm overflow

* Update T5DecodingWeight.cc

fix: fix loading bug of t5

* [Enhancement]add pytorch backend support for gptneox (NVIDIA#550)

* add pytorch backend support for gptneox

Signed-off-by: AkiyamaYummy <842720660@qq.com>

* fix early stopping invalid

* 1) Some unused parameters and logic have been removed. 2) Revisions that would affect pipeline parallelism have been reverted. 3) The code has been made capable of direct validation on TabbyML/NeoX-1.3B.

Signed-off-by: AkiyamaYummy <842720660@qq.com>

* Change the names of classes, removing 'parallel' from their names

Signed-off-by: AkiyamaYummy <842720660@qq.com>

* Format the code.

Signed-off-by: AkiyamaYummy <842720660@qq.com>

* Only print results when rank is 0.

Signed-off-by: AkiyamaYummy <842720660@qq.com>

* Add dist.init_process_group().

Signed-off-by: AkiyamaYummy <842720660@qq.com>

* update docs

Signed-off-by: AkiyamaYummy <842720660@qq.com>

---------

Signed-off-by: AkiyamaYummy <842720660@qq.com>

* Update cublasMMWrapper.cc

Fix the CUBLAS_VERSION checking of cublasMMWrapper

* Update cublasMMWrapper.cc

* fix overflow in softmax_kernel when process long seqlen and big batch_size (NVIDIA#524)

* Update unfused_attention_kernels.cu

fix bug of softmax kernel

* [Enhancement]create huggingface_gptneox_convert.py (NVIDIA#569)

* create huggingface_gptneox_convert.py

Signed-off-by: AkiyamaYummy <842720660@qq.com>

* adjust HF's multi bin files

Signed-off-by: AkiyamaYummy <842720660@qq.com>

* update gptneox_guide.md

Signed-off-by: AkiyamaYummy <842720660@qq.com>

---------

Signed-off-by: AkiyamaYummy <842720660@qq.com>

* perf(bloom): improve performance of huggingface_bloom_convert.py, decrease the time cost and the mem using (NVIDIA#568)

Co-authored-by: r.yang <r.yang@tianrang-inc.com>

* Fix/gpt early stop (NVIDIA#584)

* fix: fix bug of early stopping of gpt

* [bugfix] Fix 2-shot All Reduce correctness issue (indexing bug). (NVIDIA#672)

FasterTransformer 2-shot all reduce is implemented as a reduce-scatter + all-gather. There is an indexing bug in the all-gather step. Prior to this change, 2-shot all reduce was only producing correct results on device 0. Now, all devices have the correct results.

* fix: swap tensor bug (NVIDIA#683)

* Support size_per_head=112 (NVIDIA#660)

* fix multi-gpu build

* add support for size_per_head=112 for gpt decoder

* remove mpi_cxx from multi-gpu build for now (NVIDIA#705)

---------

Signed-off-by: AkiyamaYummy <842720660@qq.com>
Co-authored-by: byshiue <bhsueh@nvidia.com>
Co-authored-by: _yummy_ <842720660@qq.com>
Co-authored-by: Ying Sheng <sqy1415@gmail.com>
Co-authored-by: zhangxin81 <115389973+zhangxin81@users.noreply.github.com>
Co-authored-by: 杨睿 <595403043@qq.com>
Co-authored-by: r.yang <r.yang@tianrang-inc.com>
Co-authored-by: Rahul Kindi <rkindi@users.noreply.github.com>
Co-authored-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com>
Co-authored-by: Daya Khudia <37562707+dskhudia@users.noreply.github.com>
Co-authored-by: Dean Wyatte <2512762+dwyatte@users.noreply.github.com>
  • Loading branch information
11 people authored Jul 11, 2023
1 parent 303e052 commit 743369a
Show file tree
Hide file tree
Showing 44 changed files with 1,867 additions and 170 deletions.
4 changes: 4 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -61,6 +61,7 @@ FasterTransformer is built on top of CUDA, cuBLAS, cuBLASLt and C++. We provide
| Swin Transformer | TensorRT | Yes | Yes | - | - | - | - |
| ViT | PyTorch | Yes | Yes | - | - | - | - |
| ViT | TensorRT | Yes | Yes | - | - | - | - |
| GPT-NeoX | PyTorch | Yes | - | - | Yes | Yes | - |
| GPT-NeoX | Triton backend | Yes | - | - | Yes | Yes | - |
| BART/mBART | PyTorch | Yes | - | - | Yes | Yes | - |
| WeNet | C++ | Yes | - | - | - | - | - |
Expand Down Expand Up @@ -212,6 +213,9 @@ In the experiments of decoding, we updated the following parameters:

### Changelog

May 2023
- Fix bugs of generation early stopping

January 2023
- Support GPT MoE
- Support FP8 for Bert and GPT (**Experimental**)
Expand Down
2 changes: 1 addition & 1 deletion docs/gpt_guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -458,7 +458,7 @@ python ../examples/pytorch/gpt/utils/huggingface_gpt_convert.py -i gpt2-xl/ -o .
2. Run GPT on PyTorch
Basically, `gpt_example.py` includes the example how to declare a model, load a ckeckpoint, and forward context inputs and get generated outputs in Pytorch.
Basically, `gpt_example.py` includes the example how to declare a model, load a checkpoint, and forward context inputs and get generated outputs in Pytorch.
For generating outputs based on context inputs, create a text file including the context inputs (line by line) and set `--sample_file_input` to the text file path. (By default, the script will generate outputs without context inputs.) Set `--sample_file_output` to write the outputs to a file. Use `--data_type fp16/bf16` to run in FP16 or BF16.
Expand Down
56 changes: 49 additions & 7 deletions docs/gptneox_guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,7 @@ We provide the environment variables to tune for specific usage.

* Checkpoint converter
* EleutherAI
* HuggingFace
* Data type
* FP32
* FP16
Expand All @@ -46,7 +47,7 @@ We provide the environment variables to tune for specific usage.
* Bad words list
* Beam search and sampling are both supported

## Setup
## Setup from EleutherAI checkpoint

### Requirements

Expand All @@ -72,6 +73,22 @@ You may download the tokenizer config [here](https://mystic.the-eye.eu/public/AI

To tokenize/detokenize files, use the script found in `examples/pytorch/gptneox/utils/hftokenizer.py`. You may need to pass the path to the tokenizer config with the `--tokenizer` flag.

## Setup from HuggingFace checkpoint

> Please checkout https://huggingface.co/docs to learn more about the usage of the huggingface models and tokenizers.
First download a huggingface checkpoint:

```bash
git lfs clone https://huggingface.co/<MODEL_GROUP>/<MODEL_NAME>
```

Then use the script provided by FasterTransformer to convert the checkpoint to raw weights, understood by FT. You can change `-i_g` to specify the tensor parallelism size.

```bash
python ../examples/pytorch/gptneox/utils/huggingface_gptneox_convert.py -i ../path/to/your/model -o ../../path/to/fastertransformer/model -i_g 1 -m_n gptneox
```

### Run GPT-NeoX

* Generate the `gemm_config.in` file.\
Expand All @@ -89,14 +106,39 @@ To tokenize/detokenize files, use the script found in `examples/pytorch/gptneox/
mpirun -n 2 --allow-run-as-root ./bin/gptneox_example
```

E.g. by setting the `data_type` of `gptneox_config.ini` to `fp16`, users can run gpt model under fp16.
E.g. by setting the `data_type` of `gptneox_config.ini` to `fp16`, users can run gpt model under fp16.

You can then decode the `out` file with the tokenizer:

You can then decode the `out` file with the tokenizer:
```bash
wget https://mystic.the-eye.eu/public/AI/models/GPT-NeoX-20B/slim_weights/20B_tokenizer.json
../examples/pytorch/gptneox/utils/hftokenizer.py out --tokenizer 20B_tokenizer.json
```

* Run GPT on PyTorch

Basically, `gptneox_example.py` includes the example how to declare a model, load a checkpoint, and forward context inputs and get generated outputs in Pytorch.

For generating outputs based on context inputs, create a text file including the context inputs (line by line) and set `--sample_input_file` to the text file path. (By default, the script will generate outputs without context inputs.)

Run with `-h` to see more settings.

Run GPT with TP and PP on single node. Note that the number of processes must equal to `tensor_para_size * pipeline_para_size`.

```bash
# No parallelism (tensor_para_size=1, pipeline_para_size=1)
python ../examples/pytorch/gptneox/gptneox_example.py
# TP (tensor_para_size=2, pipeline_para_size=1)
mpirun -n 2 --allow-run-as-root python ../examples/pytorch/gptneox/gptneox_example.py --tensor_para_size=2 --pipeline_para_size=1 --ckpt_path="/path/to/your/model/2-gpu"
# LP (tensor_para_size=1, pipeline_para_size=2)
mpirun -n 2 --allow-run-as-root python ../examples/pytorch/gptneox/gptneox_example.py --tensor_para_size=1 --pipeline_para_size=2 --ckpt_path="/path/to/your/model/1-gpu"
# TP and LP (tensor_para_size=2, pipeline_para_size=2)
mpirun -n 4 --allow-run-as-root python ../examples/pytorch/gptneox/gptneox_example.py --tensor_para_size=2 --pipeline_para_size=2 --ckpt_path="/path/to/your/model/2-gpu"
```

```bash
wget https://mystic.the-eye.eu/public/AI/models/GPT-NeoX-20B/slim_weights/20B_tokenizer.json
../examples/pytorch/gptneox/utils/hftokenizer.py out --tokenizer 20B_tokenizer.json
```
<!-- This converter only works for customed checkpoint -->
<!-- ### Run GPT-NeoX with prompts

Expand Down
2 changes: 1 addition & 1 deletion examples/cpp/multi_gpu_gpt/gpt_example_utils.cc
Original file line number Diff line number Diff line change
Expand Up @@ -430,7 +430,7 @@ void populate_request(std::unordered_map<std::string, Tensor>& input_tensors,
}

if (request_config.is_return_context_embeddings) {
deviceMalloc(&output_context_embeddings, request_batch_size * model_config.hidden_units);
deviceMalloc(&output_context_embeddings, request_batch_size * beam_width * model_config.hidden_units);
output_tensors.insert({"context_embeddings",
{MEMORY_GPU,
TYPE_FP32,
Expand Down
192 changes: 166 additions & 26 deletions examples/pytorch/gpt/utils/huggingface_bloom_convert.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,11 +20,12 @@
import configparser
import logging
import multiprocessing
import os
import re
import time

from pathlib import Path
from typing import Optional, Union
from typing import Dict, List, Optional, Union

import numpy as np
import torch
Expand Down Expand Up @@ -77,6 +78,9 @@ def get_args():
parser.add_argument(
'-v', '--verbose', action='store_true',
help='Enable verbose logging')
parser.add_argument(
'-s', '--by-shard', action='store_true',
help='Process shard by shard, enable when converting big model like bloom 175B')
_args = parser.parse_args()

set_logger(_args.verbose)
Expand Down Expand Up @@ -301,40 +305,176 @@ def save_bloom_config(model_config: BloomConfig, save_dir: PathLike):
config.write(f, space_around_delimiters=False)


def load_state_dict(file_path: Path, dtype: torch.dtype) -> Dict[str, torch.Tensor]:
""" Load weights from model file
`safetensors` or `pytorch binary` is supported
# Args.
file_path: model file path, ends with .bin or .safetensors.
dtype: torch.dtype, data type.
# Returns.
Dict[str, torch.Tensor]
"""

state_dict = {}
if file_path.suffix == ".safetensors":
# load from safetensors file
from safetensors import safe_open
with safe_open(file_path, framework="pt", device="cpu") as f:
for k in f.keys():
state_dict[k] = f.get_tensor(k).type(dtype)
else:
# load from pytorch bin file
state_dict = torch.load(file_path, map_location="cpu")
for k in state_dict:
state_dict[k] = state_dict[k].type(dtype)
return state_dict


def get_model_files(model_name: str) -> List[Path]:
""" List all model files that you want to load and convert
# Args.
model_name: name(like `bigscience/bloom`) or local directory of the model
# Returns.
List[Path] model file paths
"""

import glob
from huggingface_hub import try_to_load_from_cache

model_dir = model_name

# get the local model directory
try:
config_file = "config.json"
# will fall back to HUGGINGFACE_HUB_CACHE
config_path = try_to_load_from_cache(
model_name, config_file, cache_dir=os.getenv("TRANSFORMERS_CACHE")
)

if config_path is not None:
# treat the model name as an huggingface model path
model_dir = os.path.dirname(config_path)
except:
# treat the model name as an explicit model path
pass

model_files = glob.glob(model_dir + "/*.bin")
try:
from safetensors import safe_open as _

st_files = glob.glob(model_dir + "/*.safetensors")
if st_files:
model_files = st_files
logger.info("loading from safetensors format")
except ImportError:
logger.info("loading from pytorch bin format")

if not model_files:
raise FileNotFoundError('model files not found')

logger.info(f"model file num: {len(model_files)}")
return [Path(i) for i in model_files]


def process_by_model_param(model_id: str, dtype: torch.dtype, tp_size: int, save_dir: Path, nproc: int):
""" Process conversion parameter by parameter.
"""

# init bloom config
model_config = BloomConfig.from_pretrained(model_id)
# list all model files
model_files = get_model_files(model_id)
# save bloom config to output dir
save_bloom_config(model_config, save_dir)

if nproc > 1:
pool = multiprocessing.Pool(nproc)
star_args = []
for model_file in model_files:
state_dict = load_state_dict(model_file, dtype)
for name in state_dict:
param = state_dict[name]
# Preprocess
param_name = convert_parameter_name(name)
param = safe_transpose(param)
param = handle_exceptions(model_config, param_name, param)
star_args.append((param_name, param.detach().cpu().numpy(), tp_size, save_dir))
pool.starmap_async(convert_and_save_parameter, star_args)
pool.close()
pool.join()
else:
for model_file in model_files:
state_dict = load_state_dict(model_file, dtype)
for name in state_dict:
param = state_dict[name]
# Preprocess
param_name = convert_parameter_name(name)
param = safe_transpose(param)
param = handle_exceptions(model_config, param_name, param)
convert_and_save_parameter(param_name, param.detach().cpu().numpy(), tp_size, save_dir)


def _process_by_model_shard(model_config, model_file, dtype: torch.dtype, tp_size: int, save_dir: Path):
state_dict = load_state_dict(model_file, dtype)
for name in state_dict:
param = state_dict[name]
# Preprocess
param_name = convert_parameter_name(name)
param = safe_transpose(param)
param = handle_exceptions(model_config, param_name, param)
convert_and_save_parameter(param_name, param.detach().cpu().numpy(), tp_size, save_dir)


def process_by_model_shard(model_id: str, dtype: torch.dtype, tp_size: int, save_dir: Path, nproc: int):
""" Process conversion shard by shard.
Benchmarks @ 64C(Intel Xeon 6326 2.90GH) x 756G:
| model | format | by-shard | nproc | elapsed(s) | mem |
|------------|------------------|----------|-------|------------|------|
| bloom-175b | safetensors x 72 | NO | 8 | 1516.66 | 350G |
| bloom-175b | safetensors x 72 | YES | 8 | 1165.03 | 50G |
| bloom-175b | safetensors x 72 | YES | 24 | 494.81 | 150G |
"""

# init bloom config
model_config = BloomConfig.from_pretrained(model_id)
# list all model files
model_files = get_model_files(model_id)
# save bloom config to output dir
save_bloom_config(model_config, save_dir)

if nproc > 1:
pool = multiprocessing.Pool(nproc)
star_args = []
for model_file in model_files:
star_args.append((model_config, model_file, dtype, tp_size, save_dir))
pool.starmap_async(_process_by_model_shard, star_args)
pool.close()
pool.join()
else:
for model_file in model_files:
_process_by_model_shard(model_config, model_file, dtype, tp_size, save_dir)


def main():
start_time = time.time()
args = get_args()
tp_size = args.tensor_para_size

dtype = DATATYPE_MAP[args.data_type]
model = AutoModel.from_pretrained(args.input_dir).cpu().type(dtype)
assert isinstance(model, torch.nn.Module)

save_dir = Path(args.output_dir) / f'{tp_size}-gpu'
save_dir.mkdir(exist_ok=True, parents=True)
save_bloom_config(model.config, save_dir)

start_time = time.time()
logger.info(f'Start the checkpoint conversion: '
f'{len(list(model.parameters()))} params')
if args.processes > 1:
pool = multiprocessing.Pool(args.processes)
star_args = []
for name, param in model.named_parameters():
# Preprocess
param_name = convert_parameter_name(name)
param = safe_transpose(param)
param = handle_exceptions(model.config, param_name, param)
star_args.append((param_name, param.detach().cpu().numpy(), tp_size, save_dir))
pool.starmap_async(convert_and_save_parameter, star_args)
pool.close()
pool.join()
if args.by_shard:
process_by_model_shard(args.input_dir, dtype, tp_size, save_dir, args.processes)
else:
for name, param in model.named_parameters():
# Preprocess
param_name = convert_parameter_name(name)
param = safe_transpose(param)
param = handle_exceptions(model.config, param_name, param)
convert_and_save_parameter(param_name, param.detach().cpu().numpy(), tp_size, save_dir)
process_by_model_param(args.input_dir, dtype, tp_size, save_dir, args.processes)

elapsed_time = time.time() - start_time
logger.info(f'Checkpoint conversion (HF >> FT) has done '
f'(elapsed time: {elapsed_time:.2f} sec)')
Expand Down
Loading

0 comments on commit 743369a

Please sign in to comment.