Update TensorRT-LLM #667

kaiyux · 2023-12-15T09:06:14Z

Model Support
- BART and mBART support in encoder-decoder models
- Support FairSeq Neural Machine Translation (NMT) family
- Mixtral-8x7B model support
  - Support weight loading for HuggingFace Mixtral model
Features
- MPT - Int4 AWQ / SmoothQuant support
- Support speculative decoding with prefilled KV cache
- Support AWQ and GPTQ support for QWEN
- Support ReduceScatter plugin
Bug fixes
- Fix tokenizer usage in quantize.py Llama 2 FP8 quantization OOM #288, thanks to the contribution from @0xymoro
- Fix LLaMa with LoRA error [Llama Model] Run LLaMa with LoRA Error #637
- Fix the first token being abnormal issue when '--gather_all_token_logits' is enabled When '--gather_all_token_logits' is enabled, the first token appears to be abnormal." #639
- Fix Expert and Tensor Parallelism for Mixture of Expert layers
- Fix an accuracy bug in MMHA multi_block_mode when certain conditions are met (large TP & 32K sequence length)
- Fix BERT FMHA flag
Performance
- Optimize Hopper warp specialized kernels
- Optimize AllReduce for parallel attention on Falcon and GPT-J
- Enable split-k for weight-only cutlass kernel when SM>=75

leavelet · 2023-12-30T11:21:45Z

cpp/tensorrt_llm/runtime/worldConfig.cpp

+    TLLM_CHECK(mTensorParallelism > 0);
+    TLLM_CHECK(mPipelineParallelism > 0);
+
+    TLLM_CHECK_WITH_INFO(static_cast<SizeType>(numDevices) >= tensorParallelism * pipelineParallelism,


There seems to be a mistake here? In a multi-node setup, the product of mGpusPerNode and the number of nodes should exceed the total of TP multiplied by PP. However, in this case, the number of GPUs per node is actually less than the product of TP and PP
@kaiyux

Thanks for the sharp catch. I will discuss with the TensorRT-LLM engineers working on C++ runtime and go back to you later.

June

Hello, is there any progress on this?

Hi @leavelet , we did not fully test multi-node support in TensorRT-LLM, it might be working, but there is no guarantee. If you need to try that, we suggest that you remove the check locally so that it won't block you. Thanks very much for your interest on our work!

Update TensorRT-LLM

fb35718

kaiyux force-pushed the kaiyu/update branch from 89faa67 to fb35718 Compare December 15, 2023 12:45

update

55683ea

Shixiaowei02 approved these changes Dec 15, 2023

View reviewed changes

kaiyux merged commit a75618d into main Dec 15, 2023

kaiyux deleted the kaiyu/update branch December 15, 2023 14:31

kaiyux mentioned this pull request Dec 15, 2023

Update TensorRT-LLM backend triton-inference-server/tensorrtllm_backend#223

Merged

leavelet reviewed Dec 30, 2023

View reviewed changes

iibw mentioned this pull request Mar 26, 2024

Mistral 7b and Mixtral 8x7b experience degraded performance (using official docs) #1305

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update TensorRT-LLM #667

Update TensorRT-LLM #667

kaiyux commented Dec 15, 2023 •

edited

Loading

leavelet Dec 30, 2023

juney-nvidia Dec 30, 2023

leavelet Jan 5, 2024

kaiyux Jan 10, 2024

Update TensorRT-LLM #667

Update TensorRT-LLM #667

Conversation

kaiyux commented Dec 15, 2023 • edited Loading

leavelet Dec 30, 2023

Choose a reason for hiding this comment

juney-nvidia Dec 30, 2023

Choose a reason for hiding this comment

leavelet Jan 5, 2024

Choose a reason for hiding this comment

kaiyux Jan 10, 2024

Choose a reason for hiding this comment

kaiyux commented Dec 15, 2023 •

edited

Loading