Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
examples/run.py
and documentation is inexamples/draft_target_model/README.md
.ModelRunnerCpp
class.isParticipant
method to the C++Executor
API to check if the current process is a participant in the executor instance.trtllm-build
command.strongly_typed=False
to build the fp16 vision engine for the multimodal example. TensorRT 10 made the defaultstrongly_typed=True
so fp32 vision engines are built, even if input ONNX files are fp16. This issue is now fixed.trtllm-build --fast-build
with fake or random weights. Thanks to @ZJLi2013 for flagging it in trtllm-build with --fast-build ignore transformer layers #2135.assistant_model
.customAllReduce
performance by using Lamport-style AllReduce + Norm fusion.memcpy
over MPI to the target model's process inorchestrator
mode. This reduces the latency between the end of the draft model generation and beginning of target inference.