Skip to content

Build AWQ format codallama-13b-base engine failed because of Padding length too large #373

@gesanqiu

Description

@gesanqiu

Now I'm developing on the main branch, and build quantized model and engine with following command:

root@dell:/workdir/TensorRT-LLM/examples/llama# CUDA_VISIBLE_DEVICES=0 python quantize.py --model_dir /workdir/hf_models/odellama-13b-instruct/ --dtype float16 --qformat int4_awq --export_path ./codellama-13b-instruct-awq-w4a16g128.pt --calib_size 32
root@dell:/workdir/TensorRT-LLM/examples/llama# python build.py --model_dir /workdir/hf_models/codellama-13b-instruct/ --quant_ckpt_path ./codellama-13b-instruct-awq-w4a16g128.pt/llama_tp1_rank0.npz --dtype float16 --remove_input_padding --use_gpt_attention_plugin float16 --enable_context_fmha --use_gemm_plugin float16 --use_weight_only --weight_only_precision int4_awq --per_group --use_inflight_batching --output_dir /workdir/trt_llm_models/codellama-13b-instruct/int4_awq_inflight/1-gpu --max_batch_size 16 --max_input_len 2048 --max_output_len 512 --rotary_base 1000000 --vocab_size 32016
[11/14/2023-04:10:16] [TRT-LLM] [I] Using paged KV cache for inflight batching mode.
[11/14/2023-04:10:16] [TRT-LLM] [I] Serially build TensorRT engines.
[11/14/2023-04:10:16] [TRT] [I] [MemUsageChange] Init CUDA: CPU +13, GPU +0, now: CPU 132, GPU 272 (MiB)
[11/14/2023-04:10:19] [TRT] [I] [MemUsageChange] Init builder kernel library: CPU +1799, GPU +312, now: CPU 2067, GPU 584 (MiB)
[11/14/2023-04:10:19] [TRT-LLM] [W] Invalid timing cache, using freshly created one
[11/14/2023-04:10:37] [TRT-LLM] [I] Loading weights from groupwise AWQ LLaMA checkpoint...
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /workdir/TensorRT-LLM/examples/llama/build.py:757 in <module>                                    │
│                                                                                                  │
│   754 │   else:                                                                                  │
│   755 │   │   args.parallel_build = False                                                        │
│   756 │   │   logger.info('Serially build TensorRT engines.')                                    │
│ ❱ 757 │   │   build(0, args)                                                                     │
│   758 │                                                                                          │
│   759 │   tok = time.time()                                                                      │
│   760 │   t = time.strftime('%H:%M:%S', time.gmtime(tok - tik))                                  │
│                                                                                                  │
│ /workdir/TensorRT-LLM/examples/llama/build.py:707 in build                                       │
│                                                                                                  │
│   704 │   │   )                                                                                  │
│   705 │   │   engine_name = get_engine_name(MODEL_NAME, args.dtype, args.tp_size,                │
│   706 │   │   │   │   │   │   │   │   │     args.pp_size, cur_rank)                              │
│ ❱ 707 │   │   engine = build_rank_engine(builder, builder_config, engine_name,                   │
│   708 │   │   │   │   │   │   │   │      cur_rank, args)                                         │
│   709 │   │   assert engine is not None, f'Failed to build engine for rank {cur_rank}'           │
│   710                                                                                            │
│                                                                                                  │
│ /workdir/TensorRT-LLM/examples/llama/build.py:545 in build_rank_engine                           │
│                                                                                                  │
│   542 │   │   │   │   │   │   │   │   │   │   **quantize_kwargs)                                 │
│   543 │   if args.per_group:                                                                     │
│   544 │   │   load_func = load_from_awq_llama if args.weight_only_precision == 'int4_awq' else   │
│ ❱ 545 │   │   load_func(tensorrt_llm_llama=tensorrt_llm_llama,                                   │
│   546 │   │   │   │     quant_ckpt_path=args.quant_ckpt_path,                                    │
│   547 │   │   │   │     mapping=mapping,                                                         │
│   548 │   │   │   │     dtype=args.dtype,                                                        │
│                                                                                                  │
│ /workdir/TensorRT-LLM/examples/llama/weight.py:1173 in load_from_awq_llama                       │
│                                                                                                  │
│   1170 │   v = [load(awq_key_list[1] + suf) for suf in awq_suffix_list]                          │
│   1171 │   if v[0].shape[0] % 64 != 0:                                                           │
│   1172 │   │   v[0] = torch.nn.functional.pad(v[0], [0, 0, 0, 64 - v[0].shape[0] % 64])          │
│ ❱ 1173 │   │   v[1] = torch.nn.functional.pad(v[1], [0, 0, 0, 64 - v[1].shape[0] % 64],          │
│   1174 │   │   │   │   │   │   │   │   │      value=1)                                           │
│   1175 │   if mapping.is_last_pp_rank():                                                         │
│   1176 │   │   process_and_assign_weight(tensorrt_llm_llama.lm_head, v, 1)                       │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
RuntimeError: Padding length too large

When I build engine with the release/0.5.0 branch, I med another issue(this model is a SFT codallama-13b-base model with 32032 vocab size):

root@dell:/workdir/TensorRT-LLM/examples/llama# python build.py --model_dir /workdir/hf_models/codellama-13b-instruct/ --quant_ckpt_path ./codellama-13b-instruct-awq-w4a16g128.pt/llama_tp1_rank0.npz --dtype float16 --remove_input_padding --use_gpt_attention_plugin float16 --enable_context_fmha --use_gemm_plugin float16 --use_weight_only --weight_only_precision int4_awq --per_group --use_inflight_batching --output_dir /workdir/trt_llm_models/codellama-13b-instruct/int4_awq_inflight/1-gpu --max_batch_size 16 --max_input_len 2048 --max_output_len 512 --rotary_base 1000000 --vocab_size 32032
[11/14/2023-03:06:33] [TRT-LLM] [I] Using paged KV cache for inflight batching mode.
[11/14/2023-03:06:33] [TRT-LLM] [I] Serially build TensorRT engines.
[11/14/2023-03:06:33] [TRT] [I] [MemUsageChange] Init CUDA: CPU +14, GPU +0, now: CPU 133, GPU 272 (MiB)
[11/14/2023-03:06:38] [TRT] [I] [MemUsageChange] Init builder kernel library: CPU +1800, GPU +312, now: CPU 2068, GPU 584 (MiB)
[11/14/2023-03:06:38] [TRT-LLM] [W] Invalid timing cache, using freshly created one
[11/14/2023-03:07:02] [TRT-LLM] [I] Loading weights from groupwise AWQ LLaMA checkpoint...
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /workdir/TensorRT-LLM/examples/llama/build.py:757 in <module>                                    │
│                                                                                                  │
│   754 │   else:                                                                                  │
│   755 │   │   args.parallel_build = False                                                        │
│   756 │   │   logger.info('Serially build TensorRT engines.')                                    │
│ ❱ 757 │   │   build(0, args)                                                                     │
│   758 │                                                                                          │
│   759 │   tok = time.time()                                                                      │
│   760 │   t = time.strftime('%H:%M:%S', time.gmtime(tok - tik))                                  │
│                                                                                                  │
│ /workdir/TensorRT-LLM/examples/llama/build.py:707 in build                                       │
│                                                                                                  │
│   704 │   │   )                                                                                  │
│   705 │   │   engine_name = get_engine_name(MODEL_NAME, args.dtype, args.tp_size,                │
│   706 │   │   │   │   │   │   │   │   │     args.pp_size, cur_rank)                              │
│ ❱ 707 │   │   engine = build_rank_engine(builder, builder_config, engine_name,                   │
│   708 │   │   │   │   │   │   │   │      cur_rank, args)                                         │
│   709 │   │   assert engine is not None, f'Failed to build engine for rank {cur_rank}'           │
│   710                                                                                            │
│                                                                                                  │
│ /workdir/TensorRT-LLM/examples/llama/build.py:545 in build_rank_engine                           │
│                                                                                                  │
│   542 │   │   │   │   │   │   │   │   │   │   **quantize_kwargs)                                 │
│   543 │   if args.per_group:                                                                     │
│   544 │   │   load_func = load_from_awq_llama if args.weight_only_precision == 'int4_awq' else   │
│ ❱ 545 │   │   load_func(tensorrt_llm_llama=tensorrt_llm_llama,                                   │
│   546 │   │   │   │     quant_ckpt_path=args.quant_ckpt_path,                                    │
│   547 │   │   │   │     mapping=mapping,                                                         │
│   548 │   │   │   │     dtype=args.dtype,                                                        │
│                                                                                                  │
│ /workdir/TensorRT-LLM/examples/llama/weight.py:1176 in load_from_awq_llama                       │
│                                                                                                  │
│   1173 │   │   v[1] = torch.nn.functional.pad(v[1], [0, 0, 0, 64 - v[1].shape[0] % 64],          │
│   1174 │   │   │   │   │   │   │   │   │      value=1)                                           │
│   1175 │   if mapping.is_last_pp_rank():                                                         │
│ ❱ 1176 │   │   process_and_assign_weight(tensorrt_llm_llama.lm_head, v, 1)                       │
│   1177 │                                                                                         │
│   1178 │   # 3. ln_f                                                                             │
│   1179 │   v = load(awq_key_list[2])                                                             │
│                                                                                                  │
│ /workdir/TensorRT-LLM/examples/llama/weight.py:1096 in process_and_assign_weight                 │
│                                                                                                  │
│   1093 │   │   weight = v[0].T.contiguous()                                                      │
│   1094 │   │   [k, n] = weight.shape                                                             │
│   1095 │   │   weight = torch_split(weight, tp_dim)                                              │
│ ❱ 1096 │   │   amax = v[1].reshape((n, k // group_size)).T.contiguous()                          │
│   1097 │   │   amax = torch_split(amax, tp_dim)                                                  │
│   1098 │   │   pre_quant_scale = v[2].reshape((1, k))                                            │
│   1099 │   │   if tp_dim == 0:                                                                   │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
RuntimeError: shape '[32064, 40]' is invalid for input of size 1281344

Metadata

Metadata

Assignees

Labels

Low PrecisionLower-precision formats (INT8/INT4/FP8) for TRTLLM quantization (AWQ, GPTQ).triagedIssue has been triaged by maintainers

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions