Update TensorRT-LLM #2413

kaiyux · 2024-11-05T07:38:55Z

Model Support
- Added support for InternVL2, see examples/multimodal/README.md.
- Added support for Qwen2-0.5B model. (Qwen2-1.5B-Instruct convert_checkpoint.py failed #2388)
Features
- Added support for per-token per-channel FP8 (namely row-wise FP8) on Ada.
- The maximum supported beam_width is extended to 256.
- Added FP8 and INT8 SmoothQuant quantization support for the InternVL2-4B variant (LLM model only), see examples/multimodal/README.md.
API
- [BREAKING CHANGE] Removed Python bindings of GptManager.
- Exposed --trust_remote_code argument to the OpenAI API server. (openai_server error #2357)
Bug fixes
- Fixed an issue that appears when building BERT. (Bug in build bert #2373)
- Fixed an issue that model is not loaded when building BERT. (2379)
- Fixed the broken executor examples. (Succeeded in Python runtime, but failed in C++ runtime #2294)
Infra
- The dependent ModelOpt version is updated to 0.19.0.

kaiyux added 2 commits November 5, 2024 07:27

open source 92c307ad86369ee668e2a6eb9d8d5e7ce549f4bb

04a2cac

Remove unexpected file.

49e150d

Shixiaowei02 approved these changes Nov 5, 2024

View reviewed changes

kaiyux merged commit b7868dd into main Nov 5, 2024

kaiyux deleted the preview/main branch November 5, 2024 08:27

aikitoria mentioned this pull request Nov 12, 2024

FP8 rowwise support possible for SM89? #2229

Closed

Provide feedback