-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Closed
Labels
Low PrecisionLower-precision formats (INT8/INT4/FP8) for TRTLLM quantization (AWQ, GPTQ).Lower-precision formats (INT8/INT4/FP8) for TRTLLM quantization (AWQ, GPTQ).triagedIssue has been triaged by maintainersIssue has been triaged by maintainers
Description
Now I'm developing on the main branch, and build quantized model and engine with following command:
root@dell:/workdir/TensorRT-LLM/examples/llama# CUDA_VISIBLE_DEVICES=0 python quantize.py --model_dir /workdir/hf_models/odellama-13b-instruct/ --dtype float16 --qformat int4_awq --export_path ./codellama-13b-instruct-awq-w4a16g128.pt --calib_size 32
root@dell:/workdir/TensorRT-LLM/examples/llama# python build.py --model_dir /workdir/hf_models/codellama-13b-instruct/ --quant_ckpt_path ./codellama-13b-instruct-awq-w4a16g128.pt/llama_tp1_rank0.npz --dtype float16 --remove_input_padding --use_gpt_attention_plugin float16 --enable_context_fmha --use_gemm_plugin float16 --use_weight_only --weight_only_precision int4_awq --per_group --use_inflight_batching --output_dir /workdir/trt_llm_models/codellama-13b-instruct/int4_awq_inflight/1-gpu --max_batch_size 16 --max_input_len 2048 --max_output_len 512 --rotary_base 1000000 --vocab_size 32016
[11/14/2023-04:10:16] [TRT-LLM] [I] Using paged KV cache for inflight batching mode.
[11/14/2023-04:10:16] [TRT-LLM] [I] Serially build TensorRT engines.
[11/14/2023-04:10:16] [TRT] [I] [MemUsageChange] Init CUDA: CPU +13, GPU +0, now: CPU 132, GPU 272 (MiB)
[11/14/2023-04:10:19] [TRT] [I] [MemUsageChange] Init builder kernel library: CPU +1799, GPU +312, now: CPU 2067, GPU 584 (MiB)
[11/14/2023-04:10:19] [TRT-LLM] [W] Invalid timing cache, using freshly created one
[11/14/2023-04:10:37] [TRT-LLM] [I] Loading weights from groupwise AWQ LLaMA checkpoint...
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /workdir/TensorRT-LLM/examples/llama/build.py:757 in <module> │
│ │
│ 754 │ else: │
│ 755 │ │ args.parallel_build = False │
│ 756 │ │ logger.info('Serially build TensorRT engines.') │
│ ❱ 757 │ │ build(0, args) │
│ 758 │ │
│ 759 │ tok = time.time() │
│ 760 │ t = time.strftime('%H:%M:%S', time.gmtime(tok - tik)) │
│ │
│ /workdir/TensorRT-LLM/examples/llama/build.py:707 in build │
│ │
│ 704 │ │ ) │
│ 705 │ │ engine_name = get_engine_name(MODEL_NAME, args.dtype, args.tp_size, │
│ 706 │ │ │ │ │ │ │ │ │ args.pp_size, cur_rank) │
│ ❱ 707 │ │ engine = build_rank_engine(builder, builder_config, engine_name, │
│ 708 │ │ │ │ │ │ │ │ cur_rank, args) │
│ 709 │ │ assert engine is not None, f'Failed to build engine for rank {cur_rank}' │
│ 710 │
│ │
│ /workdir/TensorRT-LLM/examples/llama/build.py:545 in build_rank_engine │
│ │
│ 542 │ │ │ │ │ │ │ │ │ │ **quantize_kwargs) │
│ 543 │ if args.per_group: │
│ 544 │ │ load_func = load_from_awq_llama if args.weight_only_precision == 'int4_awq' else │
│ ❱ 545 │ │ load_func(tensorrt_llm_llama=tensorrt_llm_llama, │
│ 546 │ │ │ │ quant_ckpt_path=args.quant_ckpt_path, │
│ 547 │ │ │ │ mapping=mapping, │
│ 548 │ │ │ │ dtype=args.dtype, │
│ │
│ /workdir/TensorRT-LLM/examples/llama/weight.py:1173 in load_from_awq_llama │
│ │
│ 1170 │ v = [load(awq_key_list[1] + suf) for suf in awq_suffix_list] │
│ 1171 │ if v[0].shape[0] % 64 != 0: │
│ 1172 │ │ v[0] = torch.nn.functional.pad(v[0], [0, 0, 0, 64 - v[0].shape[0] % 64]) │
│ ❱ 1173 │ │ v[1] = torch.nn.functional.pad(v[1], [0, 0, 0, 64 - v[1].shape[0] % 64], │
│ 1174 │ │ │ │ │ │ │ │ │ value=1) │
│ 1175 │ if mapping.is_last_pp_rank(): │
│ 1176 │ │ process_and_assign_weight(tensorrt_llm_llama.lm_head, v, 1) │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
RuntimeError: Padding length too largeWhen I build engine with the release/0.5.0 branch, I med another issue(this model is a SFT codallama-13b-base model with 32032 vocab size):
root@dell:/workdir/TensorRT-LLM/examples/llama# python build.py --model_dir /workdir/hf_models/codellama-13b-instruct/ --quant_ckpt_path ./codellama-13b-instruct-awq-w4a16g128.pt/llama_tp1_rank0.npz --dtype float16 --remove_input_padding --use_gpt_attention_plugin float16 --enable_context_fmha --use_gemm_plugin float16 --use_weight_only --weight_only_precision int4_awq --per_group --use_inflight_batching --output_dir /workdir/trt_llm_models/codellama-13b-instruct/int4_awq_inflight/1-gpu --max_batch_size 16 --max_input_len 2048 --max_output_len 512 --rotary_base 1000000 --vocab_size 32032
[11/14/2023-03:06:33] [TRT-LLM] [I] Using paged KV cache for inflight batching mode.
[11/14/2023-03:06:33] [TRT-LLM] [I] Serially build TensorRT engines.
[11/14/2023-03:06:33] [TRT] [I] [MemUsageChange] Init CUDA: CPU +14, GPU +0, now: CPU 133, GPU 272 (MiB)
[11/14/2023-03:06:38] [TRT] [I] [MemUsageChange] Init builder kernel library: CPU +1800, GPU +312, now: CPU 2068, GPU 584 (MiB)
[11/14/2023-03:06:38] [TRT-LLM] [W] Invalid timing cache, using freshly created one
[11/14/2023-03:07:02] [TRT-LLM] [I] Loading weights from groupwise AWQ LLaMA checkpoint...
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /workdir/TensorRT-LLM/examples/llama/build.py:757 in <module> │
│ │
│ 754 │ else: │
│ 755 │ │ args.parallel_build = False │
│ 756 │ │ logger.info('Serially build TensorRT engines.') │
│ ❱ 757 │ │ build(0, args) │
│ 758 │ │
│ 759 │ tok = time.time() │
│ 760 │ t = time.strftime('%H:%M:%S', time.gmtime(tok - tik)) │
│ │
│ /workdir/TensorRT-LLM/examples/llama/build.py:707 in build │
│ │
│ 704 │ │ ) │
│ 705 │ │ engine_name = get_engine_name(MODEL_NAME, args.dtype, args.tp_size, │
│ 706 │ │ │ │ │ │ │ │ │ args.pp_size, cur_rank) │
│ ❱ 707 │ │ engine = build_rank_engine(builder, builder_config, engine_name, │
│ 708 │ │ │ │ │ │ │ │ cur_rank, args) │
│ 709 │ │ assert engine is not None, f'Failed to build engine for rank {cur_rank}' │
│ 710 │
│ │
│ /workdir/TensorRT-LLM/examples/llama/build.py:545 in build_rank_engine │
│ │
│ 542 │ │ │ │ │ │ │ │ │ │ **quantize_kwargs) │
│ 543 │ if args.per_group: │
│ 544 │ │ load_func = load_from_awq_llama if args.weight_only_precision == 'int4_awq' else │
│ ❱ 545 │ │ load_func(tensorrt_llm_llama=tensorrt_llm_llama, │
│ 546 │ │ │ │ quant_ckpt_path=args.quant_ckpt_path, │
│ 547 │ │ │ │ mapping=mapping, │
│ 548 │ │ │ │ dtype=args.dtype, │
│ │
│ /workdir/TensorRT-LLM/examples/llama/weight.py:1176 in load_from_awq_llama │
│ │
│ 1173 │ │ v[1] = torch.nn.functional.pad(v[1], [0, 0, 0, 64 - v[1].shape[0] % 64], │
│ 1174 │ │ │ │ │ │ │ │ │ value=1) │
│ 1175 │ if mapping.is_last_pp_rank(): │
│ ❱ 1176 │ │ process_and_assign_weight(tensorrt_llm_llama.lm_head, v, 1) │
│ 1177 │ │
│ 1178 │ # 3. ln_f │
│ 1179 │ v = load(awq_key_list[2]) │
│ │
│ /workdir/TensorRT-LLM/examples/llama/weight.py:1096 in process_and_assign_weight │
│ │
│ 1093 │ │ weight = v[0].T.contiguous() │
│ 1094 │ │ [k, n] = weight.shape │
│ 1095 │ │ weight = torch_split(weight, tp_dim) │
│ ❱ 1096 │ │ amax = v[1].reshape((n, k // group_size)).T.contiguous() │
│ 1097 │ │ amax = torch_split(amax, tp_dim) │
│ 1098 │ │ pre_quant_scale = v[2].reshape((1, k)) │
│ 1099 │ │ if tp_dim == 0: │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
RuntimeError: shape '[32064, 40]' is invalid for input of size 1281344Metadata
Metadata
Assignees
Labels
Low PrecisionLower-precision formats (INT8/INT4/FP8) for TRTLLM quantization (AWQ, GPTQ).Lower-precision formats (INT8/INT4/FP8) for TRTLLM quantization (AWQ, GPTQ).triagedIssue has been triaged by maintainersIssue has been triaged by maintainers