Update TensorRT-LLM #1793

Shixiaowei02 · 2024-06-18T09:21:15Z

Model Support
- Support Qwen1.5 MoE A2.7B
- Support Phi 3 vision multimodal
Features
- Encoder-Decoder C++ Runtime TP Support
- Explicit draft tokens inflight batching
- Support local file for calibration
  - Thanks to the contribution from @DreamGenX: Support custom calibration datasets #1762
- Add batched logits post processor
- Add Hopper qgmma kernel to XQA JIT codepath
- MoE enable TP+EP
- Add lookahead decoding layer
API
- [BREAKING CHANGE] Setup buffers for explicit draft tokens decoding
- [BREAKING CHANGE] Replace all occurrences of max_output_len with max_seq_len
  - This involves trtllm-build and benchmark related parameters
- [BREAKING CHANGE] Remove GptSession Python bindings
- [BREAKING CHANGE] Add runtime max batch size to gptManagerBenchmark
- Support remaining executor API options in HLAPI
- Support get_stats and aget_stats in HL Executor while using multi-gpu
- Add iterLatencyMilliSec to stats and iteration log
Bug fixes
- Can't convert-checkpoint Mistral 7B v0.3
  - Thanks to the contribution from @Ace-RR: Can't convert-checkpoint Mistral 7B v0.3: safetensors_rust.SafetensorError: File does not contain tensor model.embed_tokens.weight #1732
- Inflight batching for fp8 Llama and Mixtral is broken
  - Thanks to the contribution from @bprus: Inflight batching for fp8 Llama and Mixtral is broken #1738
- quantize.py fails to export important data to config.json
  - Thanks to the contribution from @janpetrov: quantize.py fails to export important data to config.json (eg rotary scaling) #1676
- Refactor the dynamic decoder params
- Fix long runtime for MOE models when using FAST_BUILD
- Enhance ITensor::slice to extreme cases
- HLAPI exits gracefully on exceptions
- NaN appears in the result under the one shot all reduce strategy
- Cache ncclComm_t as weak_ptr and wrap it as shared_ptr to avoid accidentally destroyed
Memory optimization
- Support stream reader to reduce peak memory when using weight streaming
Benchmark
Performance
- Optimize the build time when XQA JIT is enabled
- Reduce number of stream when using fused decoder
Infra
Documentation
- Update documents about GEMM plugins
- Polish enc-dec readme to reflect recent changes
- Update Mixtral example docs to include Mixtral-8x22B instructions
- Simplify recurrent gemma README

open source c0bd2b69c932257678a2aad9bd8baba4b291795e

9aa7f58

Shixiaowei02 requested review from kaiyux and nv-guomingz June 18, 2024 09:21

nv-guomingz approved these changes Jun 18, 2024

View reviewed changes

Shixiaowei02 merged commit 2a115da into main Jun 18, 2024

Shixiaowei02 deleted the preview/main branch June 18, 2024 10:18

thanhlt998 mentioned this pull request Jun 19, 2024

Enc-Dec C++ Runtime Paged KV - Inflight Batching output junks while inference with multiple input texts #1753

Closed

MartinMarciniszyn mentioned this pull request Jun 24, 2024

[feature request] logits processor perfomance issue #1681

Closed

ccchow mentioned this pull request Jul 15, 2024

Inflight batching for fp8 Llama and Mixtral is broken #1738

Closed

4 tasks

wxsms mentioned this pull request Jul 15, 2024

quantize.py fails to export important data to config.json (eg rotary scaling) #1676

Closed

4 tasks

fjosw mentioned this pull request Aug 14, 2024

[Fix] Match exclude_modules pattern in convert_utils.py to quantize.py changes. #2113

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update TensorRT-LLM #1793

Update TensorRT-LLM #1793

Shixiaowei02 commented Jun 18, 2024

Update TensorRT-LLM #1793

Update TensorRT-LLM #1793

Conversation

Shixiaowei02 commented Jun 18, 2024