TensorRT-LLM对starcoder7b模型无加速效果 (Hackathon 2023) #98

young-955 · 2023-09-21T09:28:42Z

Environment

CPU architecture: x86_64
GPU name: NVIDIA A10
TensorRT branch: 9.0.0
TensorRT LLM: 0.1.3
Cuda: 12.1.66
Cudnn: 8.9.0
Container: registry.cn-hangzhou.aliyuncs.com/trt-hackathon/trt-hackathon:final_v1
NVIDIA driver version: 525.105.17
OS: Ubuntu 22.04.3 LTS x86_64
Kernel: 5.15.0-73-generic

问题简要描述

拉取https://huggingface.co/bigcode/starcoderbase-7b模型，直接使用pytorch进行推理，与将模型转化为TensorRT-LLM后进行推理，性能无明显差异

复现代码

pytorch版本推理代码

from transformers import AutoModelForCausalLM, AutoTokenizer
checkpoint = "./"
device = "cuda" # for GPU usage or "cpu" for CPU usage

tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForCausalLM.from_pretrained(checkpoint).half().cuda()

end_token = "<fim_suffix>"

import time
t1 = time.time()
inputs = tokenizer.encode("def print_hello_world():", return_tensors="pt").cuda()
outputs = model.generate(inputs, max_new_tokens=20, pad_token_id=tokenizer.pad_token_id, eos_token_id=tokenizer.convert_tokens_to_ids(end_token))
print(tokenizer.decode(outputs[0]))
t2 = time.time()
inputs = tokenizer.encode("def print_hello_world():", return_tensors="pt").cuda()
outputs = model.generate(inputs, max_new_tokens=20, pad_token_id=tokenizer.pad_token_id, eos_token_id=tokenizer.convert_tokens_to_ids(end_token))
print(tokenizer.decode(outputs[0]))
t3 = time.time()
print(f'cost: 1st infer: {t2-t1}, 2nd infer: {t3-t2}')

pytorch性能

TensorRT-LLM模型转换与推理代码

python3 hf_gpt_convert.py -p 1 --model starcoder -i ../../starcoderbase-7b -o ./c-model/starcoder --tensor-parallelism 1 --storage-type float16

python3 build.py \
    --model_dir ./c-model/starcoder/1-gpu \
    --use_gpt_attention_plugin \
    --enable_context_fmha \
    --use_layernorm_plugin \
    --use_gemm_plugin \
    --parallel_build \
    --output_dir starcoder_outputs_tp1 \
    --world_size 1

mpirun -np 1 --allow-run-as-root python3 run.py --engine_dir starcoder_outputs_tp1 --tokenizer ../../starcoderbase-7b --input_text "def print_hello_world():" --max_output_len 20

TensorRT-LLM 性能

根据上述结果，可以看出第二次推理时pytorch版本与TensorRT-LLM版本无明显差异

The text was updated successfully, but these errors were encountered:

shm007g · 2023-10-19T08:20:49Z

https://developer.nvidia.com/blog/nvidia-tensorrt-llm-supercharges-large-language-model-inference-on-nvidia-h100-gpus/
tensorRT-LLM可能主要靠多卡张量并行

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TensorRT-LLM对starcoder7b模型无加速效果 (Hackathon 2023) #98

TensorRT-LLM对starcoder7b模型无加速效果 (Hackathon 2023) #98

young-955 commented Sep 21, 2023

shm007g commented Oct 19, 2023

TensorRT-LLM对starcoder7b模型无加速效果 (Hackathon 2023) #98

TensorRT-LLM对starcoder7b模型无加速效果 (Hackathon 2023) #98

Comments

young-955 commented Sep 21, 2023

Environment

问题简要描述

复现代码

pytorch版本推理代码

pytorch性能

TensorRT-LLM模型转换与推理代码

TensorRT-LLM 性能

shm007g commented Oct 19, 2023