We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CPU architecture: x86_64 GPU name: NVIDIA A10 TensorRT branch: 9.0.0 TensorRT LLM: 0.1.3 Cuda: 12.1.66 Cudnn: 8.9.0 Container: registry.cn-hangzhou.aliyuncs.com/trt-hackathon/trt-hackathon:final_v1 NVIDIA driver version: 525.105.17 OS: Ubuntu 22.04.3 LTS x86_64 Kernel: 5.15.0-73-generic
拉取https://huggingface.co/bigcode/starcoderbase-7b模型,直接使用pytorch进行推理,与将模型转化为TensorRT-LLM后进行推理,性能无明显差异
from transformers import AutoModelForCausalLM, AutoTokenizer checkpoint = "./" device = "cuda" # for GPU usage or "cpu" for CPU usage tokenizer = AutoTokenizer.from_pretrained(checkpoint) model = AutoModelForCausalLM.from_pretrained(checkpoint).half().cuda() end_token = "<fim_suffix>" import time t1 = time.time() inputs = tokenizer.encode("def print_hello_world():", return_tensors="pt").cuda() outputs = model.generate(inputs, max_new_tokens=20, pad_token_id=tokenizer.pad_token_id, eos_token_id=tokenizer.convert_tokens_to_ids(end_token)) print(tokenizer.decode(outputs[0])) t2 = time.time() inputs = tokenizer.encode("def print_hello_world():", return_tensors="pt").cuda() outputs = model.generate(inputs, max_new_tokens=20, pad_token_id=tokenizer.pad_token_id, eos_token_id=tokenizer.convert_tokens_to_ids(end_token)) print(tokenizer.decode(outputs[0])) t3 = time.time() print(f'cost: 1st infer: {t2-t1}, 2nd infer: {t3-t2}')
python3 hf_gpt_convert.py -p 1 --model starcoder -i ../../starcoderbase-7b -o ./c-model/starcoder --tensor-parallelism 1 --storage-type float16 python3 build.py \ --model_dir ./c-model/starcoder/1-gpu \ --use_gpt_attention_plugin \ --enable_context_fmha \ --use_layernorm_plugin \ --use_gemm_plugin \ --parallel_build \ --output_dir starcoder_outputs_tp1 \ --world_size 1 mpirun -np 1 --allow-run-as-root python3 run.py --engine_dir starcoder_outputs_tp1 --tokenizer ../../starcoderbase-7b --input_text "def print_hello_world():" --max_output_len 20
根据上述结果,可以看出第二次推理时pytorch版本与TensorRT-LLM版本无明显差异
The text was updated successfully, but these errors were encountered:
https://developer.nvidia.com/blog/nvidia-tensorrt-llm-supercharges-large-language-model-inference-on-nvidia-h100-gpus/ tensorRT-LLM可能主要靠 多卡 张量并行
Sorry, something went wrong.
No branches or pull requests
Environment
CPU architecture: x86_64
GPU name: NVIDIA A10
TensorRT branch: 9.0.0
TensorRT LLM: 0.1.3
Cuda: 12.1.66
Cudnn: 8.9.0
Container: registry.cn-hangzhou.aliyuncs.com/trt-hackathon/trt-hackathon:final_v1
NVIDIA driver version: 525.105.17
OS: Ubuntu 22.04.3 LTS x86_64
Kernel: 5.15.0-73-generic
问题简要描述
拉取https://huggingface.co/bigcode/starcoderbase-7b模型,直接使用pytorch进行推理,与将模型转化为TensorRT-LLM后进行推理,性能无明显差异
复现代码
pytorch版本推理代码
pytorch性能
TensorRT-LLM模型转换与推理代码
TensorRT-LLM 性能
根据上述结果,可以看出第二次推理时pytorch版本与TensorRT-LLM版本无明显差异
The text was updated successfully, but these errors were encountered: