Update TensorRT-LLM #2297

kaiyux · 2024-10-08T09:58:39Z

Features
- ReDrafter beam search logic is updated to match Apple's ReDrafter v1.1.
- Draft-Target speculative decoding now can be done natively with just TensorRT-LLM. The driver code is located in examples/run.py and documentation is in examples/draft_target_model/README.md.
- NVIDIA Volta GPU support is deprecated and will be removed in a future release.
API
- Add logits processor support to the ModelRunnerCpp class.
- Added isParticipant method to the C++ Executor API to check if the current process is a participant in the executor instance.
- [BREAKING CHANGE] Remove builder_opt from build_config and trtllm-build command.
Bug fixes
- Explicitly specify strongly_typed=False to build the fp16 vision engine for the multimodal example. TensorRT 10 made the default strongly_typed=True so fp32 vision engines are built, even if input ONNX files are fp16. This issue is now fixed.
- Fixed an issue with SmoothQuant calibration with custom datasets. Many thanks to the contribution by @Bhuvanesh09 in fix: add support for passing calib sequence length, and num samples + fixing use of custom calibration dataset for smoothquant in llama #2243.
- Fixed an issue surrounding trtllm-build --fast-build with fake or random weights. Thanks to @ZJLi2013 for flagging it in trtllm-build with --fast-build ignore transformer layers #2135.
- Fixed an issue concerning the accuracy of speculative decoding. Also changed internals concerning speculative decoding logits to be similar to HuggingFace's assistant_model.
Performance
- Improved customAllReduce performance by using Lamport-style AllReduce + Norm fusion.
- Set static input tensors once at the beginning instead of each iteration. (This should be especially noticeable for RNN based models because the RNN state pointers are currently separate for each layer.)
- Draft model now can trigger device memcpy over MPI to the target model's process in orchestrator mode. This reduces the latency between the end of the draft model generation and beginning of target inference.

Remove cu

DanBlanaru

lgtm

open source 4dbf696ae9b74a26829d120b67ab8443d70c8e58

8809c33

Remove cu

DanBlanaru approved these changes Oct 8, 2024

View reviewed changes

DanBlanaru merged commit 8681b3a into main Oct 8, 2024

DanBlanaru deleted the preview/main branch October 8, 2024 10:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update TensorRT-LLM #2297

Update TensorRT-LLM #2297

kaiyux commented Oct 8, 2024 •

edited

Loading

DanBlanaru left a comment

Update TensorRT-LLM #2297

Update TensorRT-LLM #2297

Conversation

kaiyux commented Oct 8, 2024 • edited Loading

DanBlanaru left a comment

Choose a reason for hiding this comment

kaiyux commented Oct 8, 2024 •

edited

Loading