Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -120,7 +120,7 @@ Dynamo provides a simple way to spin up a local set of inference components incl
```
# Start an OpenAI compatible HTTP server, a pre-processor (prompt templating and tokenization) and a router.
# Pass the TLS certificate and key paths to use HTTPS instead of HTTP.
python -m dynamo.frontend --http-port 8080 [--tls-cert-path cert.pem] [--tls-key-path key.pem]
python -m dynamo.frontend --http-port 8000 [--tls-cert-path cert.pem] [--tls-key-path key.pem]

# Start the SGLang engine, connecting to NATS and etcd to receive requests. You can run several of these,
# both for the same model and for multiple models. The frontend node will discover them.
Expand All @@ -130,7 +130,7 @@ python -m dynamo.sglang.worker --model deepseek-ai/DeepSeek-R1-Distill-Llama-8B
#### Send a Request

```bash
curl localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
curl localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "deepseek-ai/DeepSeek-R1-Distill-Llama-8B",
"messages": [
{
Expand Down
2 changes: 1 addition & 1 deletion components/backends/mocker/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,7 @@ python -m dynamo.mocker \
--enable-prefix-caching

# Start frontend server
python -m dynamo.frontend --http-port 8080
python -m dynamo.frontend --http-port 8000
```

### Legacy JSON file support:
Expand Down
2 changes: 1 addition & 1 deletion components/backends/vllm/deepseek-r1.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@ node 1
On node 0 (where the frontend was started) send a test request to verify your deployment:

```bash
curl localhost:8080/v1/chat/completions \
curl localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "deepseek-ai/DeepSeek-R1",
Expand Down
2 changes: 1 addition & 1 deletion components/backends/vllm/deploy/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -197,7 +197,7 @@ See the [vLLM CLI documentation](https://docs.vllm.ai/en/v0.9.2/configuration/se
Send a test request to verify your deployment:

```bash
curl localhost:8080/v1/chat/completions \
curl localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen3-0.6B",
Expand Down
2 changes: 1 addition & 1 deletion components/frontend/README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Dynamo frontend node.

Usage: `python -m dynamo.frontend [--http-port 8080]`.
Usage: `python -m dynamo.frontend [--http-port 8000]`.

This runs an OpenAI compliant HTTP server, a pre-processor, and a router in a single process. Engines / workers are auto-discovered when they call `register_llm`.

Expand Down
2 changes: 1 addition & 1 deletion container/launch_message.txt
Original file line number Diff line number Diff line change
Expand Up @@ -48,7 +48,7 @@ tools.

Try the following to begin interacting with a model:
> dynamo --help
> python -m dynamo.frontend [--http-port 8080]
> python -m dynamo.frontend [--http-port 8000]
> python -m dynamo.vllm Qwen/Qwen2.5-3B-Instruct

To run more complete deployment examples, instances of etcd and nats need to be
Expand Down
2 changes: 1 addition & 1 deletion deploy/metrics/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ graph TD
PROMETHEUS[Prometheus server :9090] -->|:2379/metrics| ETCD_SERVER[etcd-server :2379, :2380]
PROMETHEUS -->|:9401/metrics| DCGM_EXPORTER[dcgm-exporter :9401]
PROMETHEUS -->|:7777/metrics| NATS_PROM_EXP
PROMETHEUS -->|:8080/metrics| DYNAMOFE[Dynamo HTTP FE :8080]
PROMETHEUS -->|:8000/metrics| DYNAMOFE[Dynamo HTTP FE :8000]
PROMETHEUS -->|:8081/metrics| DYNAMOBACKEND[Dynamo backend :8081]
DYNAMOFE --> DYNAMOBACKEND
GRAFANA -->|:9090/query API| PROMETHEUS
Expand Down
4 changes: 2 additions & 2 deletions docs/_includes/quick_start_local.rst
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ Get started with Dynamo locally in just a few commands:

.. code-block:: bash

# Start the OpenAI compatible frontend (default port is 8080)
# Start the OpenAI compatible frontend (default port is 8000)
python -m dynamo.frontend

# In another terminal, start an SGLang worker
Expand All @@ -34,7 +34,7 @@ Get started with Dynamo locally in just a few commands:

.. code-block:: bash

curl localhost:8080/v1/chat/completions \
curl localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model": "Qwen/Qwen3-0.6B",
"messages": [{"role": "user", "content": "Hello!"}],
Expand Down
4 changes: 2 additions & 2 deletions docs/architecture/dynamo_flow.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ This diagram shows the NVIDIA Dynamo disaggregated inference system as implement
The primary user journey through the system:

1. **Discovery (S1)**: Client discovers the service endpoint
2. **Request (S2)**: HTTP client sends API request to Frontend (OpenAI-compatible server on port 8080)
2. **Request (S2)**: HTTP client sends API request to Frontend (OpenAI-compatible server on port 8000)
3. **Validate (S3)**: Frontend forwards request to Processor for validation and routing
4. **Route (S3)**: Processor routes the validated request to appropriate Decode Worker

Expand Down Expand Up @@ -84,7 +84,7 @@ graph TD
%% Top Layer - Client & Frontend
Client["<b>HTTP Client</b>"]
S1[["<b>1 DISCOVERY</b>"]]
Frontend["<b>Frontend</b><br/><i>OpenAI Compatible Server<br/>Port 8080</i>"]
Frontend["<b>Frontend</b><br/><i>OpenAI Compatible Server<br/>Port 8000</i>"]
S2[["<b>2 REQUEST</b>"]]

%% Processing Layer
Expand Down
4 changes: 2 additions & 2 deletions docs/components/router/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,12 +14,12 @@ The Dynamo KV Router intelligently routes requests by evaluating their computati
To launch the Dynamo frontend with the KV Router:

```bash
python -m dynamo.frontend --router-mode kv --http-port 8080
python -m dynamo.frontend --router-mode kv --http-port 8000
```

This command:
- Launches the Dynamo frontend service with KV routing enabled
- Exposes the service on port 8080 (configurable)
- Exposes the service on port 8000 (configurable)
- Automatically handles all backend workers registered to the Dynamo endpoint

Backend workers register themselves using the `register_llm` API, after which the KV Router automatically:
Expand Down
2 changes: 1 addition & 1 deletion docs/guides/dynamo_deploy/create_deployment.md
Original file line number Diff line number Diff line change
Expand Up @@ -88,7 +88,7 @@ Here's a template structure based on the examples:
Consult the corresponding sh file. Each of the python commands to launch a component will go into your yaml spec under the
`extraPodSpec: -> mainContainer: -> args:`

The front end is launched with "python3 -m dynamo.frontend [--http-port 8080] [--router-mode kv]"
The front end is launched with "python3 -m dynamo.frontend [--http-port 8000] [--router-mode kv]"
Each worker will launch `python -m dynamo.YOUR_INFERENCE_BACKEND --model YOUR_MODEL --your-flags `command.
If you are a Dynamo contributor the [dynamo run guide](../dynamo_run.md) for details on how to run this command.

Expand Down
2 changes: 1 addition & 1 deletion docs/guides/metrics.md
Original file line number Diff line number Diff line change
Expand Up @@ -79,7 +79,7 @@ graph TD
PROMETHEUS[Prometheus server :9090] -->|:2379/metrics| ETCD_SERVER[etcd-server :2379, :2380]
PROMETHEUS -->|:9401/metrics| DCGM_EXPORTER[dcgm-exporter :9401]
PROMETHEUS -->|:7777/metrics| NATS_PROM_EXP
PROMETHEUS -->|:8080/metrics| DYNAMOFE[Dynamo HTTP FE :8080]
PROMETHEUS -->|:8000/metrics| DYNAMOFE[Dynamo HTTP FE :8000]
PROMETHEUS -->|:8081/metrics| DYNAMOBACKEND[Dynamo backend :8081]
DYNAMOFE --> DYNAMOBACKEND
GRAFANA -->|:9090/query API| PROMETHEUS
Expand Down
4 changes: 2 additions & 2 deletions docs/guides/planner_benchmark/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,7 @@ genai-perf profile \
--tokenizer deepseek-ai/DeepSeek-R1-Distill-Llama-8B \
-m deepseek-ai/DeepSeek-R1-Distill-Llama-8B \
--endpoint-type chat \
--url http://localhost:8080 \
--url http://localhost:8000 \
--streaming \
--input-file payload:sin_b512_t600_rr5.0-20.0-150.0_io3000150-3000150-0.2-0.8-10.jsonl
```
Expand Down Expand Up @@ -76,7 +76,7 @@ In this example, we use a fixed 2p2d engine as baseline. Planner provides a `--n
# TODO

# in terminal 2
genai-perf profile --tokenizer deepseek-ai/DeepSeek-R1-Distill-Llama-8B -m deepseek-ai/DeepSeek-R1-Distill-Llama-8B --service-kind openai --endpoint-type chat --url http://localhost:8080 --streaming --input-file payload:sin_b512_t600_rr5.0-20.0-150.0_io3000150-3000150-0.2-0.8-10.jsonl
genai-perf profile --tokenizer deepseek-ai/DeepSeek-R1-Distill-Llama-8B -m deepseek-ai/DeepSeek-R1-Distill-Llama-8B --service-kind openai --endpoint-type chat --url http://localhost:8000 --streaming --input-file payload:sin_b512_t600_rr5.0-20.0-150.0_io3000150-3000150-0.2-0.8-10.jsonl
```

## Results
Expand Down
2 changes: 1 addition & 1 deletion docs/support_matrix.md
Original file line number Diff line number Diff line change
Expand Up @@ -85,7 +85,7 @@ If you are using a **GPU**, the following GPU models and architectures are suppo


> [!Caution]
> ¹ There is a known issue with the TensorRT-LLM framework when running the AL2023 container locally with `docker run --network host ...` due to a [bug](https://github.com/mpi4py/mpi4py/discussions/491#discussioncomment-12660609) in mpi4py. To avoid this issue, replace the `--network host` flag with more precise networking configuration by mapping only the necessary ports (e.g., 4222 for nats, 2379/2380 for etcd, 8080 for frontend).
> ¹ There is a known issue with the TensorRT-LLM framework when running the AL2023 container locally with `docker run --network host ...` due to a [bug](https://github.com/mpi4py/mpi4py/discussions/491#discussioncomment-12660609) in mpi4py. To avoid this issue, replace the `--network host` flag with more precise networking configuration by mapping only the necessary ports (e.g., 4222 for nats, 2379/2380 for etcd, 8000 for frontend).


## Build Support
Expand Down
12 changes: 6 additions & 6 deletions examples/multimodal/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -73,7 +73,7 @@ bash launch/agg.sh --model Qwen/Qwen2.5-VL-7B-Instruct

In another terminal:
```bash
curl http://localhost:8080/v1/chat/completions \
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llava-hf/llava-1.5-7b-hf",
Expand Down Expand Up @@ -146,7 +146,7 @@ bash launch/disagg.sh --model llava-hf/llava-1.5-7b-hf

In another terminal:
```bash
curl http://localhost:8080/v1/chat/completions \
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llava-hf/llava-1.5-7b-hf",
Expand Down Expand Up @@ -223,7 +223,7 @@ bash launch/agg_llama.sh

In another terminal:
```bash
curl http://localhost:8080/v1/chat/completions \
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8",
Expand Down Expand Up @@ -295,7 +295,7 @@ bash launch/disagg_llama.sh

In another terminal:
```bash
curl http://localhost:8080/v1/chat/completions \
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8",
Expand Down Expand Up @@ -366,7 +366,7 @@ bash launch/video_agg.sh

In another terminal:
```bash
curl http://localhost:8080/v1/chat/completions \
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llava-hf/LLaVA-NeXT-Video-7B-hf",
Expand Down Expand Up @@ -455,7 +455,7 @@ bash launch/video_disagg.sh

In another terminal:
```bash
curl http://localhost:8080/v1/chat/completions \
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llava-hf/LLaVA-NeXT-Video-7B-hf",
Expand Down
4 changes: 2 additions & 2 deletions lib/runtime/examples/system_metrics/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -185,7 +185,7 @@ DYN_SYSTEM_ENABLED=true DYN_SYSTEM_PORT=8081 cargo run --bin system_server
The server will start an system status server on the specified port (8081 in this example) that exposes the Prometheus metrics endpoint at `/metrics`.


To Run an actual LLM frontend + server (aggregated example), launch both of them. By default, the frontend listens to port 8080.
To Run an actual LLM frontend + server (aggregated example), launch both of them. By default, the frontend listens to port 8000.
```
python -m dynamo.frontend &

Expand All @@ -202,5 +202,5 @@ Once running, you can query the metrics:
curl http://localhost:8081/metrics | grep -E "dynamo_component"

# Get all frontend metrics
curl http://localhost:8080/metrics | grep -E "dynamo_frontend"
curl http://localhost:8000/metrics | grep -E "dynamo_frontend"
```
6 changes: 3 additions & 3 deletions tests/lmcache/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -62,13 +62,13 @@ python3 summarize_scores_dynamo.py

### Baseline Architecture (deploy-baseline-dynamo.sh)
```
HTTP Request → Dynamo Ingress(8080) → Dynamo Worker → Direct Inference
HTTP Request → Dynamo Ingress(8000) → Dynamo Worker → Direct Inference
Environment: ENABLE_LMCACHE=0
```

### LMCache Architecture (deploy-lmcache_enabled-dynamo.sh)
```
HTTP Request → Dynamo Ingress(8080) → Dynamo Worker → LMCache-enabled Inference
HTTP Request → Dynamo Ingress(8000) → Dynamo Worker → LMCache-enabled Inference
Environment: ENABLE_LMCACHE=1
LMCACHE_CHUNK_SIZE=256
LMCACHE_LOCAL_CPU=True
Expand All @@ -80,7 +80,7 @@ Environment: ENABLE_LMCACHE=1
Test scripts use Dynamo's Chat Completions API:

```bash
curl -X POST http://localhost:8080/v1/chat/completions \
curl -X POST http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": Qwen/Qwen3-0.6B,
Expand Down
2 changes: 1 addition & 1 deletion tests/lmcache/mmlu-baseline-dynamo.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@
# Reference: https://github.com/LMCache/LMCache/blob/dev/.buildkite/correctness/1-mmlu.py

# ASSUMPTIONS:
# 1. dynamo is running (default: localhost:8080) without LMCache
# 1. dynamo is running (default: localhost:8000) without LMCache
# 2. the mmlu dataset is in a "data" directory
# 3. all invocations of this script should be run in the same directory
# (for later consolidation)
Expand Down
2 changes: 1 addition & 1 deletion tests/lmcache/mmlu-lmcache_enabled-dynamo.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@
# Reference: https://github.com/LMCache/LMCache/blob/dev/.buildkite/correctness/2-mmlu.py

# ASSUMPTIONS:
# 1. dynamo is running (default: localhost:8080) with LMCache enabled
# 1. dynamo is running (default: localhost:8000) with LMCache enabled
# 2. the mmlu dataset is in a "data" directory
# 3. all invocations of this script should be run in the same directory
# (for later consolidation)
Expand Down
Loading