Merge with main (#1)

* Update beam_search_topk_kernels.cu fix: fix bug of beam search * fix: change int of some kernels to int64_t to prevent overflow * fix: gpt tensor shapes inconsistency (NVIDIA#505) Signed-off-by: AkiyamaYummy <842720660@qq.com> * Update gpt_guide.md (NVIDIA#529) * fix: fix bug of gpt buffer and gpt gemm overflow * Update T5DecodingWeight.cc fix: fix loading bug of t5 * [Enhancement]add pytorch backend support for gptneox (NVIDIA#550) * add pytorch backend support for gptneox Signed-off-by: AkiyamaYummy <842720660@qq.com> * fix early stopping invalid * 1) Some unused parameters and logic have been removed. 2) Revisions that would affect pipeline parallelism have been reverted. 3) The code has been made capable of direct validation on TabbyML/NeoX-1.3B. Signed-off-by: AkiyamaYummy <842720660@qq.com> * Change the names of classes, removing 'parallel' from their names Signed-off-by: AkiyamaYummy <842720660@qq.com> * Format the code. Signed-off-by: AkiyamaYummy <842720660@qq.com> * Only print results when rank is 0. Signed-off-by: AkiyamaYummy <842720660@qq.com> * Add dist.init_process_group(). Signed-off-by: AkiyamaYummy <842720660@qq.com> * update docs Signed-off-by: AkiyamaYummy <842720660@qq.com> --------- Signed-off-by: AkiyamaYummy <842720660@qq.com> * Update cublasMMWrapper.cc Fix the CUBLAS_VERSION checking of cublasMMWrapper * Update cublasMMWrapper.cc * fix overflow in softmax_kernel when process long seqlen and big batch_size (NVIDIA#524) * Update unfused_attention_kernels.cu fix bug of softmax kernel * [Enhancement]create huggingface_gptneox_convert.py (NVIDIA#569) * create huggingface_gptneox_convert.py Signed-off-by: AkiyamaYummy <842720660@qq.com> * adjust HF's multi bin files Signed-off-by: AkiyamaYummy <842720660@qq.com> * update gptneox_guide.md Signed-off-by: AkiyamaYummy <842720660@qq.com> --------- Signed-off-by: AkiyamaYummy <842720660@qq.com> * perf(bloom): improve performance of huggingface_bloom_convert.py, decrease the time cost and the mem using (NVIDIA#568) Co-authored-by: r.yang <r.yang@tianrang-inc.com> * Fix/gpt early stop (NVIDIA#584) * fix: fix bug of early stopping of gpt * [bugfix] Fix 2-shot All Reduce correctness issue (indexing bug). (NVIDIA#672) FasterTransformer 2-shot all reduce is implemented as a reduce-scatter + all-gather. There is an indexing bug in the all-gather step. Prior to this change, 2-shot all reduce was only producing correct results on device 0. Now, all devices have the correct results. * fix: swap tensor bug (NVIDIA#683) * Support size_per_head=112 (NVIDIA#660) * fix multi-gpu build * add support for size_per_head=112 for gpt decoder * remove mpi_cxx from multi-gpu build for now (NVIDIA#705) --------- Signed-off-by: AkiyamaYummy <842720660@qq.com> Co-authored-by: byshiue <bhsueh@nvidia.com> Co-authored-by: _yummy_ <842720660@qq.com> Co-authored-by: Ying Sheng <sqy1415@gmail.com> Co-authored-by: zhangxin81 <115389973+zhangxin81@users.noreply.github.com> Co-authored-by: 杨睿 <595403043@qq.com> Co-authored-by: r.yang <r.yang@tianrang-inc.com> Co-authored-by: Rahul Kindi <rkindi@users.noreply.github.com> Co-authored-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com> Co-authored-by: Daya Khudia <37562707+dskhudia@users.noreply.github.com> Co-authored-by: Dean Wyatte <2512762+dwyatte@users.noreply.github.com>
neevaco · Jul 11, 2023 · 743369a · 743369a
1 parent 303e052
commit 743369a
Show file tree

Hide file tree

Showing 44 changed files with 1,867 additions and 170 deletions.
diff --git a/README.md b/README.md
@@ -61,6 +61,7 @@ FasterTransformer is built on top of CUDA, cuBLAS, cuBLASLt and C++. We provide
 | Swin Transformer | TensorRT       | Yes  | Yes                 | -                       | -               | -                 | -                  |
 | ViT              | PyTorch        | Yes  | Yes                 | -                       | -               | -                 | -                  |
 | ViT              | TensorRT       | Yes  | Yes                 | -                       | -               | -                 | -                  |
+| GPT-NeoX         | PyTorch        | Yes  | -                   | -                       | Yes             | Yes               | -                  |
 | GPT-NeoX         | Triton backend | Yes  | -                   | -                       | Yes             | Yes               | -                  |
 | BART/mBART       | PyTorch        | Yes  | -                   | -                       | Yes             | Yes               | -                  |
 | WeNet            | C++            | Yes  | -                   | -                       | -               | -                 | -                  |
@@ -212,6 +213,9 @@ In the experiments of decoding, we updated the following parameters:
 
 ### Changelog
 
+May 2023
+- Fix bugs of generation early stopping
+
 January 2023
 - Support GPT MoE
 - Support FP8 for Bert and GPT (**Experimental**)

diff --git a/docs/gpt_guide.md b/docs/gpt_guide.md
@@ -458,7 +458,7 @@ python ../examples/pytorch/gpt/utils/huggingface_gpt_convert.py -i gpt2-xl/ -o .
 
 2. Run GPT on PyTorch
 
-    Basically, `gpt_example.py` includes the example how to declare a model, load a ckeckpoint, and forward context inputs and get generated outputs in Pytorch.
+    Basically, `gpt_example.py` includes the example how to declare a model, load a checkpoint, and forward context inputs and get generated outputs in Pytorch.
 
     For generating outputs based on context inputs, create a text file including the context inputs (line by line) and set `--sample_file_input` to the text file path. (By default, the script will generate outputs without context inputs.) Set `--sample_file_output` to write the outputs to a file. Use `--data_type fp16/bf16` to run in FP16 or BF16.
 

diff --git a/docs/gptneox_guide.md b/docs/gptneox_guide.md
@@ -36,6 +36,7 @@ We provide the environment variables to tune for specific usage.
 
 * Checkpoint converter
   * EleutherAI
+  * HuggingFace
 * Data type
   * FP32
   * FP16
@@ -46,7 +47,7 @@ We provide the environment variables to tune for specific usage.
   * Bad words list
   * Beam search and sampling are both supported
 
-## Setup
+## Setup from EleutherAI checkpoint
 
 ### Requirements
 
@@ -72,6 +73,22 @@ You may download the tokenizer config [here](https://mystic.the-eye.eu/public/AI
 
 To tokenize/detokenize files, use the script found in `examples/pytorch/gptneox/utils/hftokenizer.py`. You may need to pass the path to the tokenizer config with the `--tokenizer` flag.
 
+## Setup from HuggingFace checkpoint
+
+> Please checkout https://huggingface.co/docs to learn more about the usage of the huggingface models and tokenizers.
+
+First download a huggingface checkpoint:
+
+```bash
+git lfs clone https://huggingface.co/<MODEL_GROUP>/<MODEL_NAME>
+```
+
+Then use the script provided by FasterTransformer to convert the checkpoint to raw weights, understood by FT. You can change `-i_g` to specify the tensor parallelism size.
+
+```bash
+python ../examples/pytorch/gptneox/utils/huggingface_gptneox_convert.py -i ../path/to/your/model -o ../../path/to/fastertransformer/model -i_g 1 -m_n gptneox
+```
+
 ### Run GPT-NeoX
 
 * Generate the `gemm_config.in` file.\
@@ -89,14 +106,39 @@ To tokenize/detokenize files, use the script found in `examples/pytorch/gptneox/
     mpirun -n 2 --allow-run-as-root ./bin/gptneox_example
     ```
 
-E.g. by setting the `data_type` of `gptneox_config.ini` to `fp16`, users can run gpt model under fp16.
+    E.g. by setting the `data_type` of `gptneox_config.ini` to `fp16`, users can run gpt model under fp16.
+
+    You can then decode the `out` file with the tokenizer:
 
-You can then decode the `out` file with the tokenizer:
+      ```bash
+      wget https://mystic.the-eye.eu/public/AI/models/GPT-NeoX-20B/slim_weights/20B_tokenizer.json
+      ../examples/pytorch/gptneox/utils/hftokenizer.py out --tokenizer 20B_tokenizer.json
+      ```
+
+* Run GPT on PyTorch
+
+    Basically, `gptneox_example.py` includes the example how to declare a model, load a checkpoint, and forward context inputs and get generated outputs in Pytorch.
+
+    For generating outputs based on context inputs, create a text file including the context inputs (line by line) and set `--sample_input_file` to the text file path. (By default, the script will generate outputs without context inputs.)
+
+    Run with `-h` to see more settings.
+
+    Run GPT with TP and PP on single node. Note that the number of processes must equal to `tensor_para_size * pipeline_para_size`.
+
+    ```bash
+    # No parallelism (tensor_para_size=1, pipeline_para_size=1)
+    python ../examples/pytorch/gptneox/gptneox_example.py
+
+    # TP (tensor_para_size=2, pipeline_para_size=1)
+    mpirun -n 2 --allow-run-as-root python ../examples/pytorch/gptneox/gptneox_example.py --tensor_para_size=2 --pipeline_para_size=1 --ckpt_path="/path/to/your/model/2-gpu"
+
+    # LP (tensor_para_size=1, pipeline_para_size=2)
+    mpirun -n 2 --allow-run-as-root python ../examples/pytorch/gptneox/gptneox_example.py --tensor_para_size=1 --pipeline_para_size=2 --ckpt_path="/path/to/your/model/1-gpu"
+
+    # TP and LP (tensor_para_size=2, pipeline_para_size=2)
+    mpirun -n 4 --allow-run-as-root python ../examples/pytorch/gptneox/gptneox_example.py --tensor_para_size=2 --pipeline_para_size=2 --ckpt_path="/path/to/your/model/2-gpu"
+    ```
 
-  ```bash
-  wget https://mystic.the-eye.eu/public/AI/models/GPT-NeoX-20B/slim_weights/20B_tokenizer.json
-  ../examples/pytorch/gptneox/utils/hftokenizer.py out --tokenizer 20B_tokenizer.json
-  ```
 <!-- This converter only works for customed checkpoint -->
 <!-- ### Run GPT-NeoX with prompts
 

diff --git a/examples/cpp/multi_gpu_gpt/gpt_example_utils.cc b/examples/cpp/multi_gpu_gpt/gpt_example_utils.cc
@@ -430,7 +430,7 @@ void populate_request(std::unordered_map<std::string, Tensor>& input_tensors,
     }
 
     if (request_config.is_return_context_embeddings) {
-        deviceMalloc(&output_context_embeddings, request_batch_size * model_config.hidden_units);
+        deviceMalloc(&output_context_embeddings, request_batch_size * beam_width * model_config.hidden_units);
         output_tensors.insert({"context_embeddings",
                                {MEMORY_GPU,
                                 TYPE_FP32,

diff --git a/examples/pytorch/gpt/utils/huggingface_bloom_convert.py b/examples/pytorch/gpt/utils/huggingface_bloom_convert.py
@@ -20,11 +20,12 @@
 import configparser
 import logging
 import multiprocessing
+import os
 import re
 import time
 
 from pathlib import Path
-from typing import Optional, Union
+from typing import Dict, List, Optional, Union
 
 import numpy as np
 import torch
@@ -77,6 +78,9 @@ def get_args():
     parser.add_argument(
         '-v', '--verbose', action='store_true',
         help='Enable verbose logging')
+    parser.add_argument(
+        '-s', '--by-shard', action='store_true',
+        help='Process shard by shard, enable when converting big model like bloom 175B')
     _args = parser.parse_args()
 
     set_logger(_args.verbose)
@@ -301,40 +305,176 @@ def save_bloom_config(model_config: BloomConfig, save_dir: PathLike):
         config.write(f, space_around_delimiters=False)
 
 
+def load_state_dict(file_path: Path, dtype: torch.dtype) -> Dict[str, torch.Tensor]:
+    """ Load weights from model file
+
+    `safetensors` or `pytorch binary` is supported
+
+    # Args.
+        file_path: model file path, ends with .bin or .safetensors.
+        dtype: torch.dtype, data type.
+    # Returns.
+        Dict[str, torch.Tensor]
+    """
+
+    state_dict = {}
+    if file_path.suffix == ".safetensors":
+        # load from safetensors file
+        from safetensors import safe_open
+        with safe_open(file_path, framework="pt", device="cpu") as f:
+            for k in f.keys():
+                state_dict[k] = f.get_tensor(k).type(dtype)
+    else:
+        # load from pytorch bin file
+        state_dict = torch.load(file_path, map_location="cpu")
+        for k in state_dict:
+            state_dict[k] = state_dict[k].type(dtype)
+    return state_dict
+
+
+def get_model_files(model_name: str) -> List[Path]:
+    """ List all model files that you want to load and convert
+
+    # Args.
+        model_name: name(like `bigscience/bloom`) or local directory of the model
+    # Returns.
+        List[Path] model file paths
+    """
+
+    import glob
+    from huggingface_hub import try_to_load_from_cache
+
+    model_dir = model_name
+
+    # get the local model directory
+    try:
+        config_file = "config.json"
+        # will fall back to HUGGINGFACE_HUB_CACHE
+        config_path = try_to_load_from_cache(
+            model_name, config_file, cache_dir=os.getenv("TRANSFORMERS_CACHE")
+        )
+
+        if config_path is not None:
+            # treat the model name as an huggingface model path
+            model_dir = os.path.dirname(config_path)
+    except:
+        # treat the model name as an explicit model path
+        pass
+
+    model_files = glob.glob(model_dir + "/*.bin")
+    try:
+        from safetensors import safe_open as _
+
+        st_files = glob.glob(model_dir + "/*.safetensors")
+        if st_files:
+            model_files = st_files
+        logger.info("loading from safetensors format")
+    except ImportError:
+        logger.info("loading from pytorch bin format")
+
+    if not model_files:
+        raise FileNotFoundError('model files not found')
+
+    logger.info(f"model file num: {len(model_files)}")
+    return [Path(i) for i in model_files]
+
+
+def process_by_model_param(model_id: str, dtype: torch.dtype, tp_size: int, save_dir: Path, nproc: int):
+    """ Process conversion parameter by parameter.
+    """
+
+    # init bloom config
+    model_config = BloomConfig.from_pretrained(model_id)
+    # list all model files
+    model_files = get_model_files(model_id)
+    # save bloom config to output dir
+    save_bloom_config(model_config, save_dir)
+
+    if nproc > 1:
+        pool = multiprocessing.Pool(nproc)
+        star_args = []
+        for model_file in model_files:
+            state_dict = load_state_dict(model_file, dtype)
+            for name in state_dict:
+                param = state_dict[name]
+                # Preprocess
+                param_name = convert_parameter_name(name)
+                param = safe_transpose(param)
+                param = handle_exceptions(model_config, param_name, param)
+                star_args.append((param_name, param.detach().cpu().numpy(), tp_size, save_dir))
+        pool.starmap_async(convert_and_save_parameter, star_args)
+        pool.close()
+        pool.join()
+    else:
+        for model_file in model_files:
+            state_dict = load_state_dict(model_file, dtype)
+            for name in state_dict:
+                param = state_dict[name]
+                # Preprocess
+                param_name = convert_parameter_name(name)
+                param = safe_transpose(param)
+                param = handle_exceptions(model_config, param_name, param)
+                convert_and_save_parameter(param_name, param.detach().cpu().numpy(), tp_size, save_dir)
+
+
+def _process_by_model_shard(model_config, model_file, dtype: torch.dtype, tp_size: int, save_dir: Path):
+    state_dict = load_state_dict(model_file, dtype)
+    for name in state_dict:
+        param = state_dict[name]
+        # Preprocess
+        param_name = convert_parameter_name(name)
+        param = safe_transpose(param)
+        param = handle_exceptions(model_config, param_name, param)
+        convert_and_save_parameter(param_name, param.detach().cpu().numpy(), tp_size, save_dir)
+
+
+def process_by_model_shard(model_id: str, dtype: torch.dtype, tp_size: int, save_dir: Path, nproc: int):
+    """ Process conversion shard by shard.
+
+    Benchmarks @ 64C(Intel Xeon 6326 2.90GH) x 756G:
+
+        | model      | format           | by-shard | nproc | elapsed(s) | mem  |
+        |------------|------------------|----------|-------|------------|------|
+        | bloom-175b | safetensors x 72 | NO       | 8     | 1516.66    | 350G |
+        | bloom-175b | safetensors x 72 | YES      | 8     | 1165.03    | 50G  |
+        | bloom-175b | safetensors x 72 | YES      | 24    | 494.81     | 150G |
+
+    """
+
+    # init bloom config
+    model_config = BloomConfig.from_pretrained(model_id)
+    # list all model files
+    model_files = get_model_files(model_id)
+    # save bloom config to output dir
+    save_bloom_config(model_config, save_dir)
+
+    if nproc > 1:
+        pool = multiprocessing.Pool(nproc)
+        star_args = []
+        for model_file in model_files:
+            star_args.append((model_config, model_file, dtype, tp_size, save_dir))
+        pool.starmap_async(_process_by_model_shard, star_args)
+        pool.close()
+        pool.join()
+    else:
+        for model_file in model_files:
+            _process_by_model_shard(model_config, model_file, dtype, tp_size, save_dir)
+
+
 def main():
+    start_time = time.time()
     args = get_args()
     tp_size = args.tensor_para_size
-
     dtype = DATATYPE_MAP[args.data_type]
-    model = AutoModel.from_pretrained(args.input_dir).cpu().type(dtype)
-    assert isinstance(model, torch.nn.Module)
 
     save_dir = Path(args.output_dir) / f'{tp_size}-gpu'
     save_dir.mkdir(exist_ok=True, parents=True)
-    save_bloom_config(model.config, save_dir)
 
-    start_time = time.time()
-    logger.info(f'Start the checkpoint conversion: '
-                f'{len(list(model.parameters()))} params')
-    if args.processes > 1:
-        pool = multiprocessing.Pool(args.processes)
-        star_args = []
-        for name, param in model.named_parameters():
-            # Preprocess
-            param_name = convert_parameter_name(name)
-            param = safe_transpose(param)
-            param = handle_exceptions(model.config, param_name, param)
-            star_args.append((param_name, param.detach().cpu().numpy(), tp_size, save_dir))
-        pool.starmap_async(convert_and_save_parameter, star_args)
-        pool.close()
-        pool.join()
+    if args.by_shard:
+        process_by_model_shard(args.input_dir, dtype, tp_size, save_dir, args.processes)
     else:
-        for name, param in model.named_parameters():
-            # Preprocess
-            param_name = convert_parameter_name(name)
-            param = safe_transpose(param)
-            param = handle_exceptions(model.config, param_name, param)
-            convert_and_save_parameter(param_name, param.detach().cpu().numpy(), tp_size, save_dir)
+        process_by_model_param(args.input_dir, dtype, tp_size, save_dir, args.processes)
+
     elapsed_time = time.time() - start_time
     logger.info(f'Checkpoint conversion (HF >> FT) has done '
                 f'(elapsed time: {elapsed_time:.2f} sec)')