You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@@ -242,6 +243,63 @@ To benchmark your deployment with GenAI-Perf, see this utility script, configuri
242
243
243
244
Dynamo with the TensorRT-LLM backend supports multimodal models, enabling you to process both text and images (or pre-computed embeddings) in a single request. For detailed setup instructions, example requests, and best practices, see the [Multimodal Support Guide](./multimodal_support.md).
244
245
246
+
## Logits Processing
247
+
248
+
Logits processors let you modify the next-token logits at every decoding step (e.g., to apply custom constraints or sampling transforms). Dynamo provides a backend-agnostic interface and an adapter for TensorRT-LLM so you can plug in custom processors.
249
+
250
+
### How it works
251
+
-**Interface**: Implement `dynamo.logits_processing.BaseLogitsProcessor` which defines `__call__(input_ids, logits)` and modifies `logits` in-place.
252
+
-**TRT-LLM adapter**: Use `dynamo.trtllm.logits_processing.adapter.create_trtllm_adapters(...)` to convert Dynamo processors into TRT-LLM-compatible processors and assign them to `SamplingParams.logits_processor`.
253
+
-**Examples**: See example processors in `lib/bindings/python/src/dynamo/logits_processing/examples/` ([temperature](../../../lib/bindings/python/src/dynamo/logits_processing/examples/temperature.py), [hello_world](../../../lib/bindings/python/src/dynamo/logits_processing/examples/hello_world.py)).
254
+
255
+
### Quick test: HelloWorld processor
256
+
You can enable a test-only processor that forces the model to respond with "Hello world!". This is useful to verify the wiring without modifying your model or engine code.
257
+
258
+
```bash
259
+
cd$DYNAMO_HOME/components/backends/trtllm
260
+
export DYNAMO_ENABLE_TEST_LOGITS_PROCESSOR=1
261
+
./launch/agg.sh
262
+
```
263
+
264
+
Notes:
265
+
- When enabled, Dynamo initializes the tokenizer so the HelloWorld processor can map text to token IDs.
266
+
- Expected chat response contains "Hello world".
267
+
268
+
### Bring your own processor
269
+
Implement a processor by conforming to `BaseLogitsProcessor` and modify logits in-place. For example, temperature scaling:
270
+
271
+
```python
272
+
from typing import Sequence
273
+
import torch
274
+
from dynamo.logits_processing import BaseLogitsProcessor
- Per-request processing only (batch size must be 1); beam width > 1 is not supported.
300
+
- Processors must modify logits in-place and not return a new tensor.
301
+
- If your processor needs tokenization, ensure the tokenizer is initialized (do not skip tokenizer init).
302
+
245
303
## Performance Sweep
246
304
247
305
For detailed instructions on running comprehensive performance sweeps across both aggregated and disaggregated serving configurations, see the [TensorRT-LLM Benchmark Scripts for DeepSeek R1 model](./performance_sweeps/README.md). This guide covers recommended benchmarking setups, usage of provided scripts, and best practices for evaluating system performance.
0 commit comments