You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/guides/dynamo_run.md
+22-5Lines changed: 22 additions & 5 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -12,7 +12,7 @@
12
12
*[llama.cpp](#llamacpp)
13
13
*[Sglang](#sglang)
14
14
*[Vllm](#vllm)
15
-
*[TensorRT-LLM](#tensorrt-llm-engine)
15
+
*[TensorRT-LLM](#trtllm)
16
16
*[Echo Engines](#echo-engines)
17
17
*[Writing your own engine in Python](#writing-your-own-engine-in-python)
18
18
*[Batch mode](#batch-mode)
@@ -437,10 +437,13 @@ Startup can be slow so you may want to `export DYN_LOG=debug` to see progress.
437
437
438
438
Shutdown: `ray stop`
439
439
440
-
#### TensorRT-LLM engine
440
+
#### trtllm
441
441
442
-
To run a TRT-LLM model with dynamo-run we have included a python based [async engine] (https://github.com/ai-dynamo/dynamo/blob/main/examples/tensorrt_llm/engines/agg_engine.py).
443
-
To configure the TensorRT-LLM async engine please see [llm_api_config.yaml](https://github.com/ai-dynamo/dynamo/blob/main/examples/tensorrt_llm/configs/llm_api_config.yaml). The file defines the options that need to be passed to the LLM engine. Follow the steps below to serve trtllm on dynamo run.
442
+
Using [TensorRT-LLM's LLM API](https://nvidia.github.io/TensorRT-LLM/llm-api/), a high-level Python API.
443
+
444
+
You can use `--extra-engine-args` to pass extra arguments to LLM API engine.
445
+
446
+
The trtllm engine requires requires [etcd](https://etcd.io/) and [nats](https://nats.io/) with jetstream (`nats-server -js`) to be running.
444
447
445
448
##### Step 1: Build the environment
446
449
@@ -454,7 +457,7 @@ See instructions [here](https://github.com/ai-dynamo/dynamo/blob/main/examples/t
454
457
455
458
Execute the following to load the TensorRT-LLM model specified in the configuration.
"--kv-block-size", type=int, default=32, help="Size of a KV cache block."
230
236
)
237
+
parser.add_argument(
238
+
"--context-length",
239
+
type=int,
240
+
default=None,
241
+
help="This argument is not used by TRTLLM. Please provide max_input_len, max_seq_len and max_output_len in yaml file and point --extra-engine-args to the yaml file.",
242
+
)
231
243
parser.add_argument(
232
244
"--extra-engine-args",
233
245
type=str,
@@ -241,6 +253,12 @@ def cmd_line_args():
241
253
)
242
254
args=parser.parse_args()
243
255
256
+
ifargs.context_lengthisnotNone:
257
+
warnings.warn(
258
+
"--context-length is accepted for compatibility but will be ignored for TensorRT-LLM. Please provide max_input_len, max_seq_len and max_output_len in yaml file and point --extra-engine-args to the yaml file.",
0 commit comments