Skip to content

Commit 3515141

Browse files
committed
docs: add usage instructions on logits processors
Signed-off-by: Bhuvan Agrawal <11240550+bhuvan002@users.noreply.github.com>
1 parent 4e4b9a2 commit 3515141

File tree

1 file changed

+58
-0
lines changed

1 file changed

+58
-0
lines changed

components/backends/trtllm/README.md

Lines changed: 58 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -43,6 +43,7 @@ git checkout $(git describe --tags $(git rev-list --tags --max-count=1))
4343
- [Client](#client)
4444
- [Benchmarking](#benchmarking)
4545
- [Multimodal Support](#multimodal-support)
46+
- [Logits Processing](#logits-processing)
4647
- [Performance Sweep](#performance-sweep)
4748

4849
## Feature Support Matrix
@@ -242,6 +243,63 @@ To benchmark your deployment with GenAI-Perf, see this utility script, configuri
242243

243244
Dynamo with the TensorRT-LLM backend supports multimodal models, enabling you to process both text and images (or pre-computed embeddings) in a single request. For detailed setup instructions, example requests, and best practices, see the [Multimodal Support Guide](./multimodal_support.md).
244245

246+
## Logits Processing
247+
248+
Logits processors let you modify the next-token logits at every decoding step (e.g., to apply custom constraints or sampling transforms). Dynamo provides a backend-agnostic interface and an adapter for TensorRT-LLM so you can plug in custom processors.
249+
250+
### How it works
251+
- **Interface**: Implement `dynamo.logits_processing.BaseLogitsProcessor` which defines `__call__(input_ids, logits)` and modifies `logits` in-place.
252+
- **TRT-LLM adapter**: Use `dynamo.trtllm.logits_processing.adapter.create_trtllm_adapters(...)` to convert Dynamo processors into TRT-LLM-compatible processors and assign them to `SamplingParams.logits_processor`.
253+
- **Examples**: See example processors in `lib/bindings/python/src/dynamo/logits_processing/examples/` ([temperature](../../../lib/bindings/python/src/dynamo/logits_processing/examples/temperature.py), [hello_world](../../../lib/bindings/python/src/dynamo/logits_processing/examples/hello_world.py)).
254+
255+
### Quick test: HelloWorld processor
256+
You can enable a test-only processor that forces the model to respond with "Hello world!". This is useful to verify the wiring without modifying your model or engine code.
257+
258+
```bash
259+
cd $DYNAMO_HOME/components/backends/trtllm
260+
export DYNAMO_ENABLE_TEST_LOGITS_PROCESSOR=1
261+
./launch/agg.sh
262+
```
263+
264+
Notes:
265+
- When enabled, Dynamo initializes the tokenizer so the HelloWorld processor can map text to token IDs.
266+
- Expected chat response contains "Hello world".
267+
268+
### Bring your own processor
269+
Implement a processor by conforming to `BaseLogitsProcessor` and modify logits in-place. For example, temperature scaling:
270+
271+
```python
272+
from typing import Sequence
273+
import torch
274+
from dynamo.logits_processing import BaseLogitsProcessor
275+
276+
class TemperatureProcessor(BaseLogitsProcessor):
277+
def __init__(self, temperature: float = 1.0):
278+
if temperature <= 0:
279+
raise ValueError("Temperature must be positive")
280+
self.temperature = temperature
281+
282+
def __call__(self, input_ids: Sequence[int], logits: torch.Tensor):
283+
if self.temperature == 1.0:
284+
return
285+
logits.div_(self.temperature)
286+
```
287+
288+
Wire it into TRT-LLM by adapting and attaching to `SamplingParams`:
289+
290+
```python
291+
from dynamo.trtllm.logits_processing.adapter import create_trtllm_adapters
292+
from dynamo.logits_processing.examples import TemperatureProcessor
293+
294+
processors = [TemperatureProcessor(temperature=0.7)]
295+
sampling_params.logits_processor = create_trtllm_adapters(processors)
296+
```
297+
298+
### Current limitations
299+
- Per-request processing only (batch size must be 1); beam width > 1 is not supported.
300+
- Processors must modify logits in-place and not return a new tensor.
301+
- If your processor needs tokenization, ensure the tokenizer is initialized (do not skip tokenizer init).
302+
245303
## Performance Sweep
246304

247305
For detailed instructions on running comprehensive performance sweeps across both aggregated and disaggregated serving configurations, see the [TensorRT-LLM Benchmark Scripts for DeepSeek R1 model](./performance_sweeps/README.md). This guide covers recommended benchmarking setups, usage of provided scripts, and best practices for evaluating system performance.

0 commit comments

Comments
 (0)