smooth-quant with tp=2, and build llama 7b with tp=2, pp=2 failed #267

forrestjgq · 2023-11-03T07:32:43Z

I wish to build llama-7b-hf model with tp size 2 and pp size 2, with smooth-quant, here is the processing:

python3 hf_llama_convert.py -i /data/jgq/lmmodels/7b/ \
                                                  -o /data/jgq/lmmodels/trt_engines/7b/sq0.8/2-gpu \
                                                  -sq 0.8 \
                                                  --tensor-parallelism 2 \
                                                  --storage-type fp16

then I build it with:

python build.py --ft_model_dir /data/jgq/lmmodels/trt_engines/7b/sq0.8/2-gpu 
--dtype float16 \
--remove_input_padding \
--enable_context_fmha \
--output_dir /data/jgq/lmmodels/trt_engines/7b/sq/b32_i1024_o2048/4-gpu \
--max_batch_size 32 \
--max_input_len 1024 \
--max_output_len 2048 \
--world_size 4 \
--tp_size 2 \
--pp_size 2 \
--use_smooth_quant \
--per_token \
--per_channel

this will lead to a failure:

[11/03/2023-07:15:08] [TRT-LLM] [I] Loading weights from FT...
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /data/jgq/trtllm/examples/llama/build.py:718 in <module>                                         │
│                                                                                                  │
│   715 │   else:                                                                                  │
│   716 │   │   args.parallel_build = False                                                        │
│   717 │   │   logger.info('Serially build TensorRT engines.')                                    │
│ ❱ 718 │   │   build(0, args)                                                                     │
│   719 │                                                                                          │
│   720 │   tok = time.time()                                                                      │
│   721 │   t = time.strftime('%H:%M:%S', time.gmtime(tok - tik))                                  │
│                                                                                                  │
│ /data/jgq/trtllm/examples/llama/build.py:689 in build                                            │
│                                                                                                  │
│   686 │   │   │   opt_level=args.builder_opt)                                                    │
│   687 │   │   engine_name = get_engine_name(MODEL_NAME, args.dtype, args.tp_size,                │
│   688 │   │   │   │   │   │   │   │   │     args.pp_size, cur_rank)                              │
│ ❱ 689 │   │   engine = build_rank_engine(builder, builder_config, engine_name,                   │
│   690 │   │   │   │   │   │   │   │      cur_rank, args)                                         │
│   691 │   │   assert engine is not None, f'Failed to build engine for rank {cur_rank}'           │
│   692                                                                                            │
│                                                                                                  │
│ /data/jgq/trtllm/examples/llama/build.py:569 in build_rank_engine                                │
│                                                                                                  │
│   566 │   │   │   │   │   │      dtype=args.dtype)                                               │
│   567 │   │   del hf_llama                                                                       │
│   568 │   elif args.ft_model_dir is not None:                                                    │
│ ❱ 569 │   │   load_from_binary(tensorrt_llm_llama,                                               │
│   570 │   │   │   │   │   │    args.ft_model_dir,                                                │
│   571 │   │   │   │   │   │    mapping,                                                          │
│   572 │   │   │   │   │   │    fp16=(args.dtype == 'float16'),                                   │
│                                                                                                  │
│ /data/jgq/trtllm/examples/llama/weight.py:671 in load_from_binary                                │
│                                                                                                  │
│    668 │   │   │   │   n_embd // mapping.tp_size +                                               │
│    669 │   │   │   │   (n_embd // n_head * n_groups) // mapping.tp_size * 2)                     │
│    670 │   │   idx = i - mapping.pp_rank * tensorrt_llm_llama.num_layers                         │
│ ❱  671 │   │   tensorrt_llm_llama.layers[idx].input_layernorm.weight.value = (fromfile(          │
│    672 │   │   │   dir_path, 'model.layers.' + str(i) + '.input_layernorm.weight.bin'))          │
│    673 │   │   t = fromfile(                                                                     │
│    674 │   │   │   dir_path, 'model.layers.' + str(i) +                                          │
│                                                                                                  │
│ /usr/local/lib/python3.10/dist-packages/tensorrt_llm/module.py:171 in __getitem__                │
│                                                                                                  │
│   168 │   │   if isinstance(idx, slice):                                                         │
│   169 │   │   │   return self.__class__(list(self._modules.values())[idx])                       │
│   170 │   │   else:                                                                              │
│ ❱ 171 │   │   │   return self._modules[self._get_abs_string_index(idx)]                          │
│   172 │                                                                                          │
│   173 │   def __setitem__(self, idx, module) -> None:                                            │
│   174 │   │   idx = self._get_abs_string_index(idx)                                              │
│                                                                                                  │
│ /usr/local/lib/python3.10/dist-packages/tensorrt_llm/module.py:162 in _get_abs_string_index      │
│                                                                                                  │
│   159 │   │   """Get the absolute index for the list of modules"""                               │
│   160 │   │   idx = operator.index(idx)                                                          │
│   161 │   │   if not (-len(self) <= idx < len(self)):                                            │
│ ❱ 162 │   │   │   raise IndexError('index {} is out of range'.format(idx))                       │
│   163 │   │   if idx < 0:                                                                        │
│   164 │   │   │   idx += len(self)                                                               │
│   165 │   │   return str(idx)                                                                    │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
IndexError: index 16 is out of range

Note that if pipeline parallelism is not applied(pp_size=1), this will succeeds.

The text was updated successfully, but these errors were encountered:

Tracin · 2023-11-03T09:54:10Z

Hi, I have reproduced this issue, I will figure out if it is a bug or lack of implementation.

Tracin · 2023-11-06T05:22:45Z

Hi, We will fix this bug in recent push.
You can modify https://github.com/NVIDIA/TensorRT-LLM/blob/release/0.5.0/examples/llama/weight.py#L660-L662 to
layers_per_pipeline_stage = tensorrt_llm_llama.num_layers // mapping.pp_size layers_range = list(range(mapping.pp_rank * layers_per_pipeline_stage, (mapping.pp_rank + 1) * layers_per_pipeline_stage, 1))
and change https://github.com/NVIDIA/TensorRT-LLM/blob/release/0.5.0/examples/llama/weight.py#L670 to
idx = i - mapping.pp_rank * layers_per_pipeline_stage

for quick fix. Besides, I think --use_gpt_attention_plugin float16 should be added for build.py

forrestjgq · 2023-11-16T02:26:51Z

@Tracin if this patch is applied, their is a building problem in int8-kv-cache+weight only
cmd line:

python -u /data/jgq/trtllm/examples/llama/build.py --dtype float16 --output_dir /data/jgq/tmpfs/int8kv_b128_i2048_o2048/tp2_pp2 --max_batch_size 128 --max_input_len 2048 --max_output_len 2048 --world_size 4 --tp_size 2 --pp_size 2 --remove_input_padding --use_gpt_attention_plugin float16 --use_gemm_plugin float16 --enable_context_fmha --use_inflight_batching --paged_kv_cache --ft_model_dir /data/jgq/tmpfs/2-gpu --int8_kv_cache --use_weight_only

error log:

╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /data/jgq/trtllm/examples/llama/build.py:718 in <module>                                         │
│                                                                                                  │
│   715 │   else:                                                                                  │
│   716 │   │   args.parallel_build = False                                                        │
│   717 │   │   logger.info('Serially build TensorRT engines.')                                    │
│ ❱ 718 │   │   build(0, args)                                                                     │
│   719 │                                                                                          │
│   720 │   tok = time.time()                                                                      │
│   721 │   t = time.strftime('%H:%M:%S', time.gmtime(tok - tik))                                  │
│                                                                                                  │
│ /data/jgq/trtllm/examples/llama/build.py:689 in build                                            │
│                                                                                                  │
│   686 │   │   │   opt_level=args.builder_opt)                                                    │
│   687 │   │   engine_name = get_engine_name(MODEL_NAME, args.dtype, args.tp_size,                │
│   688 │   │   │   │   │   │   │   │   │     args.pp_size, cur_rank)                              │
│ ❱ 689 │   │   engine = build_rank_engine(builder, builder_config, engine_name,                   │
│   690 │   │   │   │   │   │   │   │      cur_rank, args)                                         │
│   691 │   │   assert engine is not None, f'Failed to build engine for rank {cur_rank}'           │
│   692                                                                                            │
│                                                                                                  │
│ /data/jgq/trtllm/examples/llama/build.py:569 in build_rank_engine                                │
│                                                                                                  │
│   566 │   │   │   │   │   │      dtype=args.dtype)                                               │
│   567 │   │   del hf_llama                                                                       │
│   568 │   elif args.ft_model_dir is not None:                                                    │
│ ❱ 569 │   │   load_from_binary(tensorrt_llm_llama,                                               │
│   570 │   │   │   │   │   │    args.ft_model_dir,                                                │
│   571 │   │   │   │   │   │    mapping,                                                          │
│   572 │   │   │   │   │   │    fp16=(args.dtype == 'float16'),                                   │
│                                                                                                  │
│ /data/jgq/trtllm/examples/llama/weight.py:704 in load_from_binary                                │
│                                                                                                  │
│    701 │   │   │   │   # workaround for trt not supporting int8 inputs in plugins currently      │
│    702 │   │   │   │   dst.value = processed_torch_weights.view(                                 │
│    703 │   │   │   │   │   dtype=torch.float32).numpy()                                          │
│ ❱  704 │   │   │   │   scales = tensorrt_llm_llama.layers[                                       │
│    705 │   │   │   │   │   i].attention.qkv.per_channel_scale                                    │
│    706 │   │   │   │   scales.value = torch_weight_scales.numpy()                                │
│    707 │   │   │   else:                                                                         │
│                                                                                                  │
│ /usr/local/lib/python3.10/dist-packages/tensorrt_llm/module.py:171 in __getitem__                │
│                                                                                                  │
│   168 │   │   if isinstance(idx, slice):                                                         │
│   169 │   │   │   return self.__class__(list(self._modules.values())[idx])                       │
│   170 │   │   else:                                                                              │
│ ❱ 171 │   │   │   return self._modules[self._get_abs_string_index(idx)]                          │
│   172 │                                                                                          │
│   173 │   def __setitem__(self, idx, module) -> None:                                            │
│   174 │   │   idx = self._get_abs_string_index(idx)                                              │
│                                                                                                  │
│ /usr/local/lib/python3.10/dist-packages/tensorrt_llm/module.py:162 in _get_abs_string_index      │
│                                                                                                  │
│   159 │   │   """Get the absolute index for the list of modules"""                               │
│   160 │   │   idx = operator.index(idx)                                                          │
│   161 │   │   if not (-len(self) <= idx < len(self)):                                            │
│ ❱ 162 │   │   │   raise IndexError('index {} is out of range'.format(idx))                       │
│   163 │   │   if idx < 0:                                                                        │
│   164 │   │   │   idx += len(self)                                                               │
│   165 │   │   return str(idx)                                                                    │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
IndexError: index 16 is out of range

Tracin · 2023-11-16T02:47:45Z

@forrestjgq Hi, I am afraid you have to change every tensorrt_llm_llama.layers[i] to tensorrt_llm_llama.layers[idx]
From this line on https://github.com/NVIDIA/TensorRT-LLM/blob/release/0.5.0/examples/llama/weight.py#L698 and till the end of this function. I assume there are eight. I will push an MR for this, you can try fix!

forrestjgq · 2023-11-16T06:58:17Z

@forrestjgq Hi, I am afraid you have to change every tensorrt_llm_llama.layers[i] to tensorrt_llm_llama.layers[idx] From this line on https://github.com/NVIDIA/TensorRT-LLM/blob/release/0.5.0/examples/llama/weight.py#L698 and till the end of this function. I assume there are eight. I will push an MR for this, you can try fix!

It works. Thanks man!

juney-nvidia assigned Tracin Nov 3, 2023

juney-nvidia added the triaged Issue has been triaged by maintainers label Nov 3, 2023

juney-nvidia added the Low Precision Issue about lower bit quantization, including int8, int4, fp8 label Nov 5, 2023

Tracin closed this as completed Nov 6, 2023

Tracin reopened this Nov 16, 2023

Tracin closed this as completed Nov 16, 2023

kaiyux mentioned this issue Nov 17, 2023

Update TensorRT-LLM #422

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

smooth-quant with tp=2, and build llama 7b with tp=2, pp=2 failed #267

smooth-quant with tp=2, and build llama 7b with tp=2, pp=2 failed #267

forrestjgq commented Nov 3, 2023 •

edited

Loading

Tracin commented Nov 3, 2023

Tracin commented Nov 6, 2023 •

edited

Loading

forrestjgq commented Nov 16, 2023

Tracin commented Nov 16, 2023

forrestjgq commented Nov 16, 2023

smooth-quant with tp=2, and build llama 7b with tp=2, pp=2 failed #267

smooth-quant with tp=2, and build llama 7b with tp=2, pp=2 failed #267

Comments

forrestjgq commented Nov 3, 2023 • edited Loading

Tracin commented Nov 3, 2023

Tracin commented Nov 6, 2023 • edited Loading

forrestjgq commented Nov 16, 2023

Tracin commented Nov 16, 2023

forrestjgq commented Nov 16, 2023

forrestjgq commented Nov 3, 2023 •

edited

Loading

Tracin commented Nov 6, 2023 •

edited

Loading